Re: Robot Exclusion Standard Revisited (LONG)

Martijn Koster (m.koster@webcrawler.com)
Tue, 4 Jun 1996 09:43:41 -0700


At 11:30 AM 5/30/96, Charles P. Kollar wrote:

>The following paper was persented at the recent spidering workshop.
>It's initial intent was to address some of the perceived ambiguities
>in the Robot Exclusion Standard. Comments, or suggestions (public or
>private) are welcome!
>
>http://www.kollar.com/robots.html

Amazing what movement a workshop can create. For ages there is little
comment, and all of a sudden there are three/four papers/movements
to extend the Standard for Robots Exclusion. :-)

I finally found time to give your paper the attention it deserves,
and am including comments/thoughts below.

> Clearly the choice of an appropriate agent id string,
> by the person writing the /robots.txt file, is
> something of an art.

I'd go as far as saying it's lack of documentation by robot authors :-)
It might help if the List of Active Robots included a description
of what strings a particular robot looks for. Can people send me
email with those strings?

> we found the actual description of the standard ... to be rather confusing
> and ambiguous.

Well, part of that is the problem with trying to write something that
is both sufficient for implementation, yet understandable enough for
fresh webmasters out there. Not everyone wants to handle long documents,
protocol specification speak, or BNF.

It might make sense for these purposes to be split...

1> [ # comment string NL ]*
2> User-Agent: [ [ WS ]+ agent id ]+ [ [ WS ]* # comment string ]? NL
3> [ # comment string NL ]*
4> Disallow: [ [ WS ]+ path root ]* [ [ WS ]* # comment string ]? NL
5> [
6> # comment string NL
7> |
8> Disallow: [ [ WS ]+ path root ]* [ [ WS ]* # comment string ]? NL
9> ]*
10> [ NL ]+

The standard says:

| Comments can be included in file using UNIX bourne shell
| conventions: the '#' character is used to
| indicate that preceding space (if any) and the remainder of
| the line up to the line termination is discarded.
| Lines containing only a comment are discarded completely,
| and therefore do not indicate a record boundary.</p>

Which allows more than your syntax above: there should be a [WS]*
in front of the "comment string" lines. I'm also surprised to see
NL in line 6, yet [NL]* in line 9; this looks inconsistent and can
easily lead to incorrect parsing.

>Represents one of the following: [NL,CR,or CRNL]. To adhere to the
>Canonicalization and Text Defaults of the HTTP standard.

There is a grammatical problem with that sentence.

> path root
> This consists of any number of printable characters that are not
> included in WS, and NL.

Well no, they are restricted to be the abs_path (RFC 1808) part of
a http URL, which is a smaller set than printable characters.

Hmmm, to be honest I'm not sure what the correct wording should be;
I'd refer to RFC 1738 and RFC 1808, as we are discussing a http URL
outside the context of a HTTP server. The HTTP spec RFC 1945 takes
more liberty:

| For definitive information on URL syntax and semantics, see RFC
| 1738[4] and RFC 1808[11]. The BNF above includes national characters
| not allowed in valid URLs as specified by RFC 1738, since HTTP servers
| are not restricted in the set of unreserved characters allowed to
| represent the rel_path part of addresses, and HTTP proxies may receive
| requests for URIs not defined by RFC 1738.

But I don't think this applies to us, as we're not an HTTP server, and
not a proxy.

> The case of this string should be considered
> significant (as it is in the Unix file system).

The case should indeed be considered significant, but that is because URL's
are case-sensitive, not because of some file system.

There is a broader ommision in the Standard:
the lack of a proper definition of the path root comparison, which should
refer to (or replicate) RFC 1945 Section 3.2.2, or the better description in
draft-ietf-v11-04a. It goes beyond a string compare, as you need to
handle %-encoded octets.

>A comment line begins with the '#' character in the first position
>of the line.

No, see above.

>Discussion
>...
>Any number of agent id(s) can be placed on the User-Agent line so
>long as they are separated by white space (WS), but the User-Agent
>line must have at least one agent id. If multiple agent id(s) are
>placed on the line then the disallow path root(s)
>described by the record are to apply to all agents listed.

Nope, you've misunderstood that, probably because the word "field" isn't
properly defined in the Standard (I tend to use it like in RFC 822 uses
field-name; Sigh, been around Mail/X.500 systems too long :-).

The Standard says:

> If more than one User-agent field is present the record
> describes an identical access policy for more
> than one robot. At least one field needs to be present
> per record.

Ie, if you want to have your record apply to multiple robots
you need to repeat the User-agent line. While I'm not that much
against your interpretation we should discuss it as an extension,
not clarification.

> Any number of path root(s) can be placed on the Disallow line so long
> as they are separated by white space (WS).

You made that up too :-) I know people do it in practice, but nowhere
in the Standard does it say you can. Yes, we should handle it.

>A Disallow line may also specify no path roots.

You ommit the associated semantics that all URL's are then free game.

The standard doesn't explicitly say how to resolve something like:

> Disallow:
> Disallow: /

which should be fixed.

>There is no limit to the number of Disallow lines that a record may have.

What do you guys do in practice? If the number of Disallow lines exceeds a
threshold we disallow the whole site.

>The entire record is terminated by an empty line.

Or EOF

> If any record does not adhere to these rules then that record should be
> ignored by the robot.

Hmmm, that's a bit harsh. Remember the Robustness Principle (RFC 1123):

"Be liberal in what you accept, and conservative in what you send"

If someone spells "Disalow", I would rather you assumed they meant Disallow,
or at the very least only drop that line, not the whole record. The standard
should not dictate dropping the record, but leave the behaviour undefined.

> The entirety of each path root of the record will be compared with the
> lowest order or leftmost portion of the complete path of the absolute url.
> Case will be significant in the comparison. An affirmative comparison is
> one in which each and every character of a path root exactly matches the
> corresponding character in the complete path of the absolute url.

No, see above. And they're octets.

> Questions from the robots mailing list

> We have arbitrarily established the following rule. A robot will use the
> cached /robots.txt file if it has been retrieved within the current day UTC.
> This rule is easy to compute, and says that on average a /robots.txt file
> will not be retrieved more than once from a site in a 24 hr period.

I always think 1-7 days is sensible.

>Therefore, we feel that it is reasonable for a spider to place some limit
>on the information that it caches. When we run into situations where the
>file exceeds the threshold of information that we are willing to keep,
>we do not cache the file.

Interesting. Does that mean you retrieve it prior to every request to a
remote host? Do you apply all the rules? What if you run out of memory?
I'd recommend normalising these huge files to a "Disallow: /" for safety.

> [cgi] The consensus appears to be that this is covered already if the
>user places data in the Disallow directories.

The word "directory" here is confusing.

> In general limits on the rate of file retreval

s/retreval/retrieval/

> Marking content as static or dynamic
>
> The creator of a document may wish to tell a spider or a user
> bookmarking a page whether the content of the url is designed
> to change from access to access (dynamic). In this manner the spider
> would not save the url, and the web browser may or may not bookmark it.
> While this information could be conveyed in the /robots.txt file,
> it is possible that the user may not have the ability to modify it.

Hmm, interesting definition of "dynamic", I always use that term for
on-the-fly generated pages, and use "short-lived" for the above use.
I feel this can be covered by either the HTTP Expires header, or
<META NAME="ROBOTS" VALUE="NOINDEX"> we discussed during the workshop.
I believe the extra semantic differentiation for this case is not really
useful.

> Guidelines for spidering a page
>
> The creator of a page may wish to have that page removed from consideration
> by spiders. While (again) this information could be conveyed in the
> /robots.txt file, it is possible that the user may not have the ability to
> modify it.
>
> The following meta tag would be used to tell all spiders not to index
> this page. The default "SPIDER BY" of a document would be all.
>
> <META NAME="SPIDER BY" CONTENT="ALL">
> <META NAME="SPIDER BY" CONTENT="NONE">

This has been superseded by the ROBOTS meta tag. [I expect Fuzzy will post
the minutes sometime this week]

> Referencing a page that should be used instead of this one
>
> There are cases where the url is generated dynamically as a method of
> tracking a user, or the content of the URL is generated dynamically
> (by a CGI script), or the url is one of a site of mirrors and
> Canonicalization is an issue. In these cases it would be convenient
> for the document to give all spiders a reference to another document.
>
>The following meta tag would be used to tell all spiders not to index
>this url, but to index the CONTENT url instead.
>
> <META NAME="URL" CONTENT="absolute url">

To clarify, I take it you mean drop the current document and retrieve and
index the other one, rather than documenting the current content under that
URL (which would open you up to spoofing).

We couldn't reach 5-minute concensus on that on during the workshop.
I believe such a mechanism would be quite useful. However, I think the
syntax above is not the best way to proceed; the name "URL" is too broad
for the purpose, and it may be better to use the LINK tag that was invented
for the purpose of defining relationships between documents:

<LINK NAME="INDEXINSTEAD" CONTENT="absolute url">

My other favourite proposed extension is a PleaseVisit field, with either
an absolute or relative URL.

Conslusion:
- Yes, the current spec is not rigid enough
- Your interpretation doesn't reflect the standard (or my intention anyway)
everywhere, which highlights the first problem :-)
- I have some issues with your definition too :-)
- We seem to agree on desired additional functionality
- We should keep the momentum going and nail a better document,
if not two: a spec and a webmaster's guide.

Comments?

-- Martijn

Email: m.koster@webcrawler.com
WWW: http://info.webcrawler.com/mak/mak.html