> The spider I wrote (Pioneer) refreshes /robots.txt files for each new
> execution of the robot. During its run it will try to
...
> This works fine for me because the longest period time I've
> ever run the robot without interruptions is 10 hours.
Let's consider these points here:
- refreshing robots.txt cache should not happen too often to increase
the bandwidth the robot takes noticeably. For example, if you
fetch a document every 5 minutes, and do it for 1000 documents,
it sounds like refreshing robots.txt even with if-modified-since
sounds unnecessary
- still the cached copy should adapt to the changes by the webmaster
reasonably fast, like once a day could be maximum I think
- if you hit the site often, it is more likely that the webmaster would
like to guide you with a better robots.txt and he also would like
it to have effect soon.
Now examples of what I think sounds good:
Your robot hits one document a day on the given site. You should refresh
your robots.txt every time immediately before getting the document.
Your robot hits a document every hour. Refresh once a day, I think.
Your robot hits a document every two minutes. In an hour or half an
hour you should not make too much damage but the extra robots.txt hits
would not annoy anyone.
robots.txt retrievals per day ~ sqrt (hits per day), minimum once a day ??
:-)