Something must be said about these limits.
- Some minimum which one should expect the robot to handle. (I would
say "MUST handle", but of course robots do as they please.)
- What should the robot do when it reaches a limit? Assume Disallow by
default, or Allow, or somehow depend on the record for the user-agent
in question, or try to follow the User-Agent: * record (if found
before the limit) instead, or...?
A related point: it might be useful to allow robots to tell www sites
that they did not like their robots.txt. E.g.
Errors-To: /cgi-bin/robot-message
early in a bad http://www.uio.no/robots.txt might cause the robot to
access
http://www.uio.no/cgi-bin/robot-message ?
error=Too+big+regexp &
robots-txt-line=32 &
URL=/failed/on/this/url
where anything after the '?' is up to the robot, not specified by the
RFC. (It would have to be read by a human anyway, so little would be
gained by some specific format.)
Regards,
Hallvard
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html