Re: How long to cache robots.txt for?

Micah A. Williams (micah@sequent.uncfsu.edu)
Tue, 9 Jul 96 22:53:33 EDT


In the words of Aaron Nabil,
>
> Well, I've already touched on the implications of ignoring robots.txt,
> even when you aren't a robot. :(
>
> I also have a real honest-to-goodness robot running. It does obey
> robots.txt. Currently, after transfering it once, it never transfers
> it again. That's one extreme of caching robots.txt.
>
> The other extreme would be no caching, and testing it before GETting
> each URL. A little less extreme would be doing a get-if-modified-since
> on robots.txt before each transfer.
>
> My next robot implementation is going to cache robots.txt for a fixed
> period, say 1 week. Does this sound reasonable?

The spider I wrote (Pioneer) refreshes /robots.txt files for each new
execution of the robot. During its run it will try to
retrieve a robots.txt file for each new host it encounters. Then,
that policy file, if it exists, will be considered active for
that particular site until the robot shuts down.

This works fine for me because the longest period time I've
ever run the robot without interruptions is 10 hours.

If you're gonna rev up the spider then go for a vacation,
then perhaps another method is in order.

-Micah

-- 
============================================================================
Micah A. Williams | Computer Science | Fayetteville State University	
micah@sequent.uncfsu.edu | http://sequent.uncfsu.edu/~micah/ 
Bjork WebPage: http://sequent.uncfsu.edu/~micah/bjork.html
Though we may not realize it, we all, in some capacity, work for Keyser Soze. 
============================================================================