RE: RFC, draft 1

Keiji Kanazawa (keijik@microsoft.com)
Sat, 16 Nov 1996 09:50:12 -0800


Unless I'm mistaken, this proposal is incomplete. It does not state the
behavior when there is no match. The way it is written, it gives the
impression (if any) that no match implies no access.

I suggest that the intention was that no match implies that the agent is
allowed.

Otherwise, a terminal "Allow: /" is needed to allow crawling of the
whole site (or the rest of the site, I should say) by default. This is
likely to cause a lot of grief, since it would be all too easy to forget
adding that terminal "Allow: /". Especially since the implicit default
is currently just that.

You could argue that not allowing as a default is good manners, but you
can easily imagine the opposite: that a site starts complaining to you
that their content doesn't come up in searches, and the reason was that
they forgot to put the last "Allow: /" in the access list. After the
20th site where you manually check their robots.txt and write them back
that they forgot the terminal "Allow: /", you will definitely wish that
the default had been set to allow all.

>----------
>From: m.koster@webcrawler.com[SMTP:m.koster@webcrawler.com]
>Sent: Friday, November 15, 1996 5:02 PM
>To: robots@webcrawler.com
>Subject: RFC, draft 1
>
>
>Hallvard wrote recently:
>
>> We need official WWW standards to refer to the Robot Exclusion Standard
>
>I finally sat down and wrote a new specification of the Standard for
>Robot Exclusion (at the expense of reading the list :-)
>My focus is not on new features (although I did add Allow),
>but on providing a more solid specification which addresses concerns of
>ambiguity and completeness. As a result the spec itself looks complex,
>though the format itself isn't; this is _not_ a user guide, that would
>look completely different, and far simpler.
>
>FWIW, it also conforms to RFC format, so that it can be submitted to the
>RFC editor once we're happy with it; though we're not a WG, I would like
>to see some rough consesus first.
>
>In other words, this is more what I _should_ have written in 1994.
>
>I would really appreciate constructive criticism on this document.
>After two days of writing I'm probably glazing over...
>It is both appended and available on http://info.webcrawler.com/mak/
>projects/robots/robots.html.
>
>Incidentally, I do expect this will make introduction of new features
>much easier too, as diffs will make it a lot easier to spot potential
>problems. (Like, anyone noticed '*' is legal in a URL path? :-)
>
>Regards,
>
>-- Martijn Koster
>
>Draft version 1, Fri Nov 15 14:46:55 PST 1996
>
>
>
>
>
>Network Working Group M. Koster
>Request for Comments: NNNN WebCrawler
>Category: Informational November 1996
>
>
>
> A Method for Robots Exclusion
>
>Status of this Memo
>
> This memo provides information for the Internet community. This memo
> does not specify an Internet standard of any kind. Distribution of
> this memo is unlimited.
>
>Table of Contents
>
> 1. Abstract . . . . . . . . . . . . . . . . . . . . . . . . . 1
> 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . 1
> 3. Specification . . . . . . . . . . . . . . . . . . . . . . . 1
> 3.1 Access method . . . . . . . . . . . . . . . . . . . . . . . 1
> 3.2 File Format Description . . . . . . . . . . . . . . . . . . 1
> 3.2.1 The User-agent line . . . . . . . . . . . . . . . . . . . . 1
> 3.2.2 The Allow and Disallow lines . . . . . . . . . . . . . . . 1
> 3.3 File Format Syntax . . . . . . . . . . . . . . . . . . . . 1
> 3.4 Expiration . . . . . . . . . . . . . . . . . . . . . . . . 1
> 4. Implementor's Notes . . . . . . . . . . . . . . . . . . . . 1
> 4.1 Backwards Compatibility . . . . . . . . . . . . . . . . . . 1
> 4.2 Interoperability . . .. . . . . . . . . . . . . . . . . . . 1
> 5. Security Considerations . . . . . . . . . . . . . . . . . . 1
> 6. References . . . . . . . . . . . . . . . . . . . . . . . . 1
> 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . 1
> 8. Author's Address . . . . . . . . . . . . . . . . . . . . . 1
>
> [XXX once completed, break into pages and update page-numbers in table
> of contents]
>
>1. Abstract
>
> This memo defines a method for administrators of sites on the World-
> Wide Web to give instructions to visiting Web robots, most
> importantly what areas of the site are to be avoided.
>
> This document provides a more rigid specification of the Standard
> for Robots Exclusion [1], which is currently in wide-spread use by
> the Web community since 1994. The specification here is intended to
> replace said standard, while aiming to be backwards compatible.
>
>
>2. Introduction
>
> Web Robots (also called "Wanderers" or "Spiders") are Web client
> programs that automatically traverse the Web's hypertext structure
> by retrieving a document, and recursively retrieving all documents
> that are referenced.
>
> Note that "recursively" here doesn't limit the definition to any
> specific traversal algorithm; even if a robot applies some heuristic
> to the selection and order of documents to visit and spaces out
> requests over a long space of time, it qualifies to be called a
> robot.
>
> Robots are often used for maintenance and indexing purposes, by
> people other than the administrators of the site being visited. In
> some cases such visits may have undesirable effects which the
> administrators would like to prevent, such as indexing of an
> unannounced site, traversal of parts of the site which require vast
> resources of the server, recursive traversal of an infinite URL
> space, etc.
>
> The technique specified in this memo allows Web site administrators
> to indicate to visiting robots which parts of the site should be
> avoided. It is solely up to the visiting robot to consult this
> information and act accordingly. Blocking parts of the Web site
> regardless of a robot's compliance with this method are outside
> the scope of this memo.
>
>
>3. The Specification
>
> This memo specifies a format for encoding instructions to visiting
> robots, and specifies an access method to retrieve these
> instructions. Robots must retrieve these instructions before visiting
> other URLs on the site, and use the instructions to determine if
> other URLs on the site can be accessed.
>
>3.1 Access method
>
> The instructions must be accessible via HTTP [2] from the site that
> the instructions are to be applied to, as a resource of Internet
> Media Type [3] "text/plain" under a standard relative path on the
> server: "/robots.txt".
>
> For convenience we will refer to this resource as the "/robots.txt
> file", though the resource need in fact not originate from a file-
> system.
>
> Some examples of URLs [4] for sites and URLs for corresponding
> "/robots.txt" sites:
>
> http://www.foo.com/welcome.html http://www.foo.com/robots.txt
>
> http://www.bar.com:8001/ http://www.bar.com:8001/robots.txt
>
> If the server response indicates Success (HTTP 2xx Status Code,)
> the robot must read the content, parse it, and follow any
> instructions applicable to that robot.
>
> If the server response indicates the resource does not exist (HTTP
> Status Code 404), the robot can assume no instructions are
> available, and that access to the site is unrestricted.
>
> Specific behaviors for other server responses are not required by
> this specification, though the following behaviours are recommended:
>
> - On server response indicating access restrictions (HTTP Status
> Code 401 or 403) a robot should regard access to the site
> completely restricted.
>
> - On the request attempt resulted in temporary failure a robot
> should defer visits to the site until such time as the resource
> can be retrieved.
>
> - On server response indicating Redirection (HTTP Status Code 3XX)
> a robot should follow the redirects until a resource can be
> found.
>
>
>3.2 File Format Description
>
> The instructions are encoded as a formatted plain text object,
> described here. A complete BNF-like description of the syntax of this
> format is given in section 3.3.
>
> The format logically consists of a non-empty set or records,
> separated by blank lines. The records consist of a set of lines of
> the form:
>
> <Field> ":" <value>
>
> In this memo we refer to lines with a Field "foo" as "foo lines".
>
> The record starts with one or more User-agent lines, specifying
> which robots the record applies to, followed by "Disallow" and
> "Allow" instructions to that robot. For example:
>
> User-agent: webcrawler
> User-agent: infoseek
> Allow: /tmp/ok.html
> Disallow: /tmp
> Disallow: /user/foo
>
> These lines are discussed separately below.
>
> Comments are allowed anywhere in the file, and consist of a comment
> character '#' followed by the comment, terminated by the end-of-line.
>
>3.2.1 The User-agent line
>
> The User-agent line indicates to which specific robots the record
> applies.
>
> The line either specifies a simple name for a robot, or "*",
> indicating this record is the default record for robots for which
> no explicit User-agent line can be found in any of the records.
>
> The choice of a name(s) the robot scans for needs to be simple,
> obvious and well documented. Robots should use the same name in the
> User-agent field of a HTTP request, minus version information. Note
> that the syntax for the token in the "/robots.txt" file is more
> restrictive than the product token syntax for the HTTP User-agent
> field.
>
> The name comparisons are case-insensitive.
>
> For example, a fictional company FigTree Search Services who names
> their robot "Fig Tree", send HTTP requests like:
>
> GET / HTTP/1.0
> User-agent: FigTree/0.1 Robot/1.0 libwww-perl/5.04
>
> might scan the "/robots.txt" file for records with:
>
> User-agent: figtree
>
> Where possible, robots should specify the name(s) they scan for in
> included documentation.
>
>3.2.2 The Allow and Disallow lines
>
> These lines indicate whether accessing a URL that matches the
> corresponding path is allowed or disallowed. Note that these
> instructions apply to any HTTP method on a URL.
>
> To evaluate if a URL is allowed, a robot must attempt to match
> the paths in Allow and Disallow lines against the URL, in the order
> they occur in the record. The first match found is used.
>
> The matching process compares every octet in the path portion of
> the URL and the path from the record. If a %xx encoded octet is
> encountered it is unencoded prior to comparison, unless it is the
> "/" character, which has special meaning in a path. The match
> evaluates positively if and only if the end of the path from the
> record is reached before a difference in octets is encountered.
>
> This table illustrates some examples:
>
> Record Path URL path Matches
> /tmp /tmp yes
> /tmp /tmp.html yes
> /tmp /tmp/a.html yes
> /tmp/ /tmp no
> /tmp/ /tmp/ yes
> /tmp/ /tmp/a.html yes
> /a%3cd.html /a%3cd.html yes
> /a%3Cd.html /a%3cd.html yes
> /a%3cd.html /a%3Cd.html yes
> /a%3cd.html /a%3cd.html yes
> /a%2fb.html /a%2fb.html yes
> /a%2fb.html /a%2Fb.html yes
> /a%2fb.html /a/b.html no
> /%7ejoe/index.html /~joe/index.html yes
> /~joe/index.html /%7Ejoe/index.html yes
>
>3.3 File Format Syntax
>
> This is a BNF-like description, using the conventions of RFC 822 [5],
> except that "|" is used to designate alternatives. Briefly, literals
> are quoted with "", parentheses "(" and ")" are used to group
> elements, optional elements are enclosed in [brackets], and elements
> may be preceded with <n>* to designate n or more repetitions of the
> following element; n defaults to 0.
>
> [XXX ought to feed through flex/bison to check rules; any takers?]
>
> robotstxt = *blankcomment
> | *blankcomment record *( 1*commentblank 1*record )
> *blankcomment
> blankcomment = 1*(blank | commentline)
> commentblank = *commentline blank *(blankcomment)
> blank = *space CRLF
> CRLF = CR LF
> record = *commentline agentline *(commentline | agentline)
> 1*ruleline *(commentline | ruleline)
> agentline = "User-agent:" *space agent *space [comment] CRLF
> ruleline = (disallowline | allowline)
> disallowline = "Disallow:" *space path *space [comment] CRLF
> allowline = "Allow:" *space rpath *space [comment] CRLF
> commentline = comment CRLF
> comment = "#" anychar
> space = 1*(SP | HT)
> rpath = "/" path
> agent = token
> anychar = <any CHAR except CR or LF>
> CHAR = <any US-ASCII character (octets 0 - 127)>
> CR = <US-ASCII CR, carriage return (13)>
> LF = <US-ASCII LF, linefeed (10)>
> SP = <US-ASCII SP, space (32)>
> HT = <US-ASCII HT, horizontal-tab (9)>
>
> The syntax for "path" is defined in RFC 1808, reproduced here for
> convenience:
>
> path = fsegment *( "/" segment )
> fsegment = 1*pchar
> segment = *pchar
>
> pchar = uchar | ":" | "@" | "&" | "="
> uchar = unreserved | escape
> unreserved = alpha | digit | safe | extra
>
> escape = "%" hex hex
> hex = digit | "A" | "B" | "C" | "D" | "E" | "F" |
> "a" | "b" | "c" | "d" | "e" | "f"
>
> alpha = lowalpha | hialpha
> lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
> "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
> "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
> hialpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
> "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
> "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
>
> digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
> "8" | "9"
>
> safe = "$" | "-" | "_" | "." | "+"
> extra = "!" | "*" | "'" | "(" | ")" | ","
>
> The syntax for "token" is taken from RFC 1945, reproduced here for
> convenience:
>
> token = 1*<any CHAR except CTLs or tspecials>
>
> tspecials = "(" | ")" | "<" | ">" | "@"
> | "," | ";" | ":" | "\" | <">
> | "/" | "[" | "]" | "?" | "="
> | "{" | "}" | SP | HT
>
>3.4 Expiration
>
> Robots should cache /robots.txt files, but if they do they must
> periodically verify the cached copy is fresh before using its
> contents.
>
> Standard HTTP cache-control mechanisms can be used by both origin
> server and robots to influence the caching of the /robots.txt file.
> Specifically robots should take note of Expires header set by the
> origin server.
>
> If no cache-control directives are present robots should default to
> an expiry of 7 days.
>
>
>4. Notes for Implementors
>
>4.1 Backwards Compatibility
>
> Previous of this specification didn't provide the Allow line. The
> introduction of the Allow line causes robots to behave slightly
> differently under either specification:
>
> If a /robots.txt contains an Allow which overrides a later occurring
> Disallow, a robot ignoring Allow lines will not retrieve those
> parts. This is considered acceptable because there is no requirement
> for a robot to access URLs it is allowed to retrieve, and it is safe,
> in that no URLs a Web site administrator wants to Disallow are be
> allowed. It is expected this may in fact encourage robots to upgrade
> compliance to the specification in this memo.
>
>4.2 Interoperability
>
> Implementors should pay particular attention to the robustness in
> parsing of the /robots.txt file. Web site administrators who are not
> aware of the /robots.txt mechanisms often notice repeated failing
> request for it in their log files, and react by putting up pages
> asking "What are you looking for?".
>
> As the majority of /robots.txt files are created with platform-
> specific text editors, robots should be liberal in accepting files
> with different end-of-line conventions, specifically CR and LF in
> addition to CRLF.
>
>
>5. Security Considerations
>
> There are a few risks in the method described here, which may affect
> either origin server or robot.
>
> Web site administrators must realise this method is voluntary, and
> is not sufficient to guarantee some robots will not visit restricted
> parts of the URL space. Failure to use proper authentication or other
> restriction may result in exposure of restricted information. It even
> possible that the occurence of paths in the /robots.txt file may
> expose the existence of resources not otherwise linked to on the
> site, which may aid people guessing for URLs.
>
> Robots need to be aware that the amount of resources spent on dealing
> with the /robots.txt is a function of the file contents, which is not
> under the control of the robot. To prevent denial-of-service attacks,
> robots are therefore encouraged to place limits on the resources
> spent on processing of /robots.txt.
>
>
>6. Acknowledgements
>
> The author would like the subscribers to the robots mailing list for
> their contributions to this specification.
>
>
>7. References
>
> [1] Koster, M., "A Standard for Robot Exclusion",
> http://info.webcrawler.com/mak/projects/robots/norobots.html,
> June 1994.
>
> [2] Berners-Lee, T., Fielding, R., and Frystyk, H., "Hypertext
> Transfer Protocol -- HTTP/1.0." RFC 1945, MIT/LCS, May 1996.
>
> [3] Postel, J., "Media Type Registration Procedure." RFC 1590,
> USC/ISI, March 1994.
>
> [4] Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform
> Resource Locators (URL)", RFC 1738, CERN, Xerox PARC,
> University of Minnesota, December 1994.
>
> [5] Crocker, D., "Standard for the Format of ARPA Internet Text
> Messages", STD 11, RFC 822, UDEL, August 1982.
>
> [6] Fielding, R., "Relative Uniform Resource Locators", RFC 1808,
> UC Irvine, June 1995.
>
>8. Author's Address
>
> Martijn Koster
> WebCrawler
> America Online
> 690 Fifth Street
> San Francisco
> CA 94107
>
> Phone: 415-3565431
> EMail: m.koster@webcrawler.com
>
>-- Martijn
>
>Email: m.koster@webcrawler.com
>WWW: http://info.webcrawler.com/mak/mak.html
>
>
>_________________________________________________
>This messages was sent by the robots mailing list. To unsubscribe, send mail
>to robots-request@webcrawler.com with the word "unsubscribe" in the body.
>For more info see http://info.webcrawler.com/mak/projects/robots/robots.html
>
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html