Great draft. I have some specific comments below...
> 3.1 Access method
>
> The instructions must be accessible via HTTP [2] from the site that
> the instructions are to be applied to, as a resource of Internet
> Media Type [3] "text/plain" under a standard relative path on the
> server: "/robots.txt".
Works with HTTPS too, or is that implied?
> Specific behaviors for other server responses are not required by
> this specification, though the following behaviours are recommended:
>
> - On server response indicating access restrictions (HTTP Status
> Code 401 or 403) a robot should regard access to the site
> completely restricted.
>
> - On the request attempt resulted in temporary failure a robot
> should defer visits to the site until such time as the resource
> can be retrieved.
>
> - On server response indicating Redirection (HTTP Status Code 3XX)
> a robot should follow the redirects until a resource can be
> found.
Following redirects really should be required. It would make life for
servers
which serve out documents for multiple web hosts much easier. Big
servers are often the ones most sensitive to robots. For example,
a server which serves out dozens of vantiy domains could more easily
implement /robots.txt per domain using redirection like so:
http://www.vanity1.com/robots.txt -> redirect -> /robots/vanity1.txt
http://www.vanity2.com/robots.txt -> redirect -> /robots/vanity2.txt
One of the problems with the robots.txt format is that it doesn't
address
servers which serve virtual collects like the vanity names (or software
virtual servers). Redirection might be the easier way to go, rather
than
modifying the robots.txt format to support collections.
I suppose the alternative is to require that these folks generate
robots.txt
on-the-fly with a CGI or something, but I don't think that's as
attractive.
>
> 3.2.2 The Allow and Disallow lines
>
> These lines indicate whether accessing a URL that matches the
> corresponding path is allowed or disallowed. Note that these
> instructions apply to any HTTP method on a URL.
>
> To evaluate if a URL is allowed, a robot must attempt to match
> the paths in Allow and Disallow lines against the URL, in the order
> they occur in the record. The first match found is used.
>
> The matching process compares every octet in the path portion of
> the URL and the path from the record. If a %xx encoded octet is
> encountered it is unencoded prior to comparison, unless it is the
> "/" character, which has special meaning in a path. The match
> evaluates positively if and only if the end of the path from the
> record is reached before a difference in octets is encountered.
>
> This table illustrates some examples:
>
> Record Path URL path Matches
> /tmp /tmp yes
> /tmp /tmp.html yes
> /tmp /tmp/a.html yes
> /tmp/ /tmp no
> /tmp/ /tmp/ yes
> /tmp/ /tmp/a.html yes
> /a%3cd.html /a%3cd.html yes
> /a%3Cd.html /a%3cd.html yes
> /a%3cd.html /a%3Cd.html yes
> /a%3cd.html /a%3cd.html yes
> /a%2fb.html /a%2fb.html yes
> /a%2fb.html /a%2Fb.html yes
> /a%2fb.html /a/b.html no
> /%7ejoe/index.html /~joe/index.html yes
> /~joe/index.html /%7Ejoe/index.html yes
This is very helpful. Is the %2f thing valid for URLs?
Should double check with the URL spec. I think that '/'
has special meaning and isn't equivalent to %2f. I'm
not sure that the /a%2fb == /a/b is correct.
Excellent. The addition of Allow is much needed.
It makes the format much more useful.
On a related topic, my experience with robots (Harvest and Netscape)
has shown that people want very
powerful ways to express the filtering rules since the
Web content/servers are getting more and more complex.
Our robot uses a very general mechanism for defining
filtering rules based on a variety of information sources
other than the URL (e.g., content-type, protocol, host:port,
actual file data, etc.). Refer to this URL for the full details
on how our Robots define their filtering rules.
http://developer.netscape.com/library/documentation/catguide/index.htm
My point is that in the future the format should be able to specify
Allow/Disallow rules based on information other than the URL.
Content-type
or Content-Language are useful ones for example. Here's an example of
what I mean in a hacked syntax:
User-Agent: Some-French-Robot
Allow[Content-Language]: french
Disallow[Content-Language]: .*
User-Agent: *
Disallow[Content-Type]: image/jpeg
Allow[Content-Type]: *
The Harvest Gatherer also allows some more flexible rule definitions
based on protocols, hosts, etc. Details on that are here:
>
> 3.4 Expiration
>
> Robots should cache /robots.txt files, but if they do they must
> periodically verify the cached copy is fresh before using its
> contents.
>
> Standard HTTP cache-control mechanisms can be used by both origin
> server and robots to influence the caching of the /robots.txt file.
> Specifically robots should take note of Expires header set by the
> origin server.
>
> If no cache-control directives are present robots should default to
> an expiry of 7 days.
It might be very tough for content providers to use the "standard HTTP
cache-control"
mechanisms to specify Expires headers, for example, since robots.txt
uses
the text/plain type, not HTML. Typically, you would use HTML to do
this:
<META HTTP-EQUIV="Expires" CONTENT="blah">
or whatever. Many Web servers have poor support for expiration in HTTP.
So, I'd suggest explicitly adding an Expiration field to the robots.txt
format.
Using the HTTP date format, IMHO.
>
> 4. Notes for Implementors
>
> 4.1 Backwards Compatibility
>
> Previous of this specification didn't provide the Allow line. The
> introduction of the Allow line causes robots to behave slightly
> differently under either specification:
>
> If a /robots.txt contains an Allow which overrides a later occurring
> Disallow, a robot ignoring Allow lines will not retrieve those
> parts. This is considered acceptable because there is no requirement
> for a robot to access URLs it is allowed to retrieve, and it is safe,
> in that no URLs a Web site administrator wants to Disallow are be
> allowed. It is expected this may in fact encourage robots to upgrade
> compliance to the specification in this memo.
I would like to see version information in the file format which
identifies
the version of the format used. It could be done via a comment on the
first
line like so:
# robots.txt-Version: 2.0
or something. Or, through a version tag:
Version: 2.0
I think that it would help with future backward compatibilty problems,
and
for robots.txt which are programmatically generated IMHO.
>
> 4.2 Interoperability
>
> Implementors should pay particular attention to the robustness in
> parsing of the /robots.txt file. Web site administrators who are not
> aware of the /robots.txt mechanisms often notice repeated failing
> request for it in their log files, and react by putting up pages
> asking "What are you looking for?".
>
> As the majority of /robots.txt files are created with platform-
> specific text editors, robots should be liberal in accepting files
> with different end-of-line conventions, specifically CR and LF in
> addition to CRLF.
This must change to encourage any kind of real deployment of robots.txt
on servers. robots.txt must be programmatically generated by Web server
administration tools or something. The format should encourage such
things. Perhaps we could put some robots.txt authoring code to help
the admin tools guys...
>
> 5. Security Considerations
>
> There are a few risks in the method described here, which may affect
> either origin server or robot.
>
Using certificates to verify the authenticity of a /robots.txt sounds
very
valuable.
> Web site administrators must realise this method is voluntary, and
> is not sufficient to guarantee some robots will not visit restricted
> parts of the URL space. Failure to use proper authentication or other
> restriction may result in exposure of restricted information. It even
> possible that the occurence of paths in the /robots.txt file may
> expose the existence of resources not otherwise linked to on the
> site, which may aid people guessing for URLs.
Perhaps mention the channels that Web site administrators can use to
enforce /robots.txt honoring. Things like checking the 'From' field
to get an email address for the robot runner, the 'User-Agent' field
to get the identity of the robot runner, the 'Referer' field to find
out more information about that robot. Maybe even the robots mailing
list to report out-of-control robots, etc.
-Darren
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html