Re: How long to cache robots.txt

Mike Agostino (mna@infoseek.com)
Thu, 18 Jul 1996 12:04:21 -0700


> From owner-robots@webcrawler.com Thu Jul 18 11:07 PDT 1996
> Date: Thu, 18 Jul 1996 09:36:11 -0500 (CDT)
> From: "Daniel T. Martin" <MARTIND@carleton.edu>
> Subject: Re: How long to cache robots.txt
> To: robots@webcrawler.com
>
> While this has not to my knowledge affected the performance of my server, it
> occurs to me that there may be a bug in the way InfoSeek's robot is caching it's
> copies of robots.txt; I have robots.txt mapped to a little script that logs the
> time and any other headers in a separate log file - this seems to show that in
> some cases infoseek isn't caching for more than 10 minutes!
>

An explanation:

Infoseek's spider caches the robots.txt for the duration of a
particular process. Normally, a process will process several thousand
urls and then exit. A single site's urls are normally contrained to be
running in a single process. However, if, as it appears to be in this
case, a particular site spans multiple processes we will ask for the
robots.txt more frequently.

This implementation is fairly conservative about caching of
robots.txt. We developed this policy after receiving complaints from
both ends of the spectrum regarding caching of robots.txt. Some
webmasters believed it should always, always, always be retrieved
before *every* access because they want their changes to robots.txt to
be seen instantly. The other extreme want spider writers to cache it
for up to a month. IMHO, neither extreme is correct. Our policy may
not be the best technical solution but as a "service company" we are
often obliged to take middle of the road solutions that hopefully will
bother the least number of people.

Mike

====================================================================
Mike Agostino mna@infoseek.com
InfoSeek Corporation
VP, Pizza Configuration
"I feel myself getting the urge to build an igloo" -- The Dead Milkmen