Re: RFC, draft 1

Captain Napalm (spc@armigeron.com)
Fri, 15 Nov 1996 23:15:35 -0500 (EST)


It was thus said that the Great Martijn Koster once stated:
>
>
> Hallvard wrote recently:
>
> > We need official WWW standards to refer to the Robot Exclusion Standard
>
Maybe a better name would be 'Robot Access Standard'. This isn't just for
exclusion anymore.

> I finally sat down and wrote a new specification of the Standard for
> Robot Exclusion (at the expense of reading the list :-)
> My focus is not on new features (although I did add Allow),
> but on providing a more solid specification which addresses concerns of
> ambiguity and completeness.

Any reason for this? It seems that we're fairly close (maybe) to a new
standard, and if an RFC is going to be submitted, it might be better to
include the new features that are deemed sorely needed (like Visit-time: and
Request-rate: plus whatever will be Allow: and Disallow: with regular
expressions).

Not to say that the work here isn't good. It is (certainly clarifies
things) but maybe, if you can hold off for a bit, a better, newer standard
can be made into an RFC.

> I would really appreciate constructive criticism on this document.
> After two days of writing I'm probably glazing over...

I know the feeling.

> Incidentally, I do expect this will make introduction of new features
> much easier too, as diffs will make it a lot easier to spot potential
> problems. (Like, anyone noticed '*' is legal in a URL path? :-)
>
Oh. Uh ... wow. <Johnny Carson>I did not know that</Johnny Carson>.

> 3.4 Expiration
>
> Robots should cache /robots.txt files, but if they do they must
> periodically verify the cached copy is fresh before using its
> contents.
>
You might want to add just how often a robot can periodically verify the
cached copy, else we get the robot that checks before any file is
downloaded. I guess a safe method would be to use HEAD and check the date.
You might want to mention this.

> 4.1 Backwards Compatibility
>
> Previous of this specification didn't provide the Allow line. The
> introduction of the Allow line causes robots to behave slightly
> differently under either specification:
>
> If a /robots.txt contains an Allow which overrides a later occurring
> Disallow, a robot ignoring Allow lines will not retrieve those
> parts. This is considered acceptable because there is no requirement
> for a robot to access URLs it is allowed to retrieve, and it is safe,
> in that no URLs a Web site administrator wants to Disallow are be
> allowed. It is expected this may in fact encourage robots to upgrade
> compliance to the specification in this memo.
>
I guess this means we now have Allow: with 1.0 semantics. Okay, time to
update my extended version of robots.txt then 8-)

> Robots need to be aware that the amount of resources spent on dealing
> with the /robots.txt is a function of the file contents, which is not
> under the control of the robot. To prevent denial-of-service attacks,
> robots are therefore encouraged to place limits on the resources
> spent on processing of /robots.txt.
>
Could you clarify what you mean by this?
>
> 6. Acknowledgements
>
> The author would like the subscribers to the robots mailing list for
^
to thank (I assume you meant this)

> their contributions to this specification.
>
-spc (Would like to see a consensus soon ... )

_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html