Re: RFC, draft 1

Hallvard B Furuseth (h.b.furuseth@usit.uio.no)
Tue, 10 Dec 1996 23:32:19 +0100 (MET)


> Robots need to be aware that the amount of resources spent on dealing
> with the /robots.txt is a function of the file contents, which is not
> under the control of the robot. To prevent denial-of-service attacks,
> robots are therefore encouraged to place limits on the resources
> spent on processing of /robots.txt.

Something must be said about these limits.

- Some minimum which one should expect the robot to handle. (I would
say "MUST handle", but of course robots do as they please.)

- What should the robot do when it reaches a limit? Assume Disallow by
default, or Allow, or somehow depend on the record for the user-agent
in question, or try to follow the User-Agent: * record (if found
before the limit) instead, or...?

A related point: it might be useful to allow robots to tell www sites
that they did not like their robots.txt. E.g.
Errors-To: /cgi-bin/robot-message
early in a bad http://www.uio.no/robots.txt might cause the robot to
access
http://www.uio.no/cgi-bin/robot-message ?
error=Too+big+regexp &
robots-txt-line=32 &
URL=/failed/on/this/url
where anything after the '?' is up to the robot, not specified by the
RFC. (It would have to be read by a human anyway, so little would be
gained by some specific format.)

Regards,

Hallvard
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html