RE: RFC, draft 1

Keiji Kanazawa (keijik@microsoft.com)
Sat, 16 Nov 1996 12:05:08 -0800


Just to be clear, I am talking about the case where there was a matching
user-agent string.

>----------
>From: Keiji Kanazawa
>Sent: Saturday, November 16, 1996 9:50 AM
>To: 'robots@webcrawler.com'; 'm.koster@webcrawler.com'
>Subject: RE: RFC, draft 1
>
>Unless I'm mistaken, this proposal is incomplete. It does not state the
>behavior when there is no match. The way it is written, it gives the
>impression (if any) that no match implies no access.
>
>I suggest that the intention was that no match implies that the agent is
>allowed.
>
>Otherwise, a terminal "Allow: /" is needed to allow crawling of the
>whole site (or the rest of the site, I should say) by default. This is
>likely to cause a lot of grief, since it would be all too easy to forget
>adding that terminal "Allow: /". Especially since the implicit default
>is currently just that.
>
>You could argue that not allowing as a default is good manners, but you
>can easily imagine the opposite: that a site starts complaining to you
>that their content doesn't come up in searches, and the reason was that
>they forgot to put the last "Allow: /" in the access list. After the
>20th site where you manually check their robots.txt and write them back
>that they forgot the terminal "Allow: /", you will definitely wish that
>the default had been set to allow all.
>
>>----------
>>From: m.koster@webcrawler.com[SMTP:m.koster@webcrawler.com]
>>Sent: Friday, November 15, 1996 5:02 PM
>>To: robots@webcrawler.com
>>Subject: RFC, draft 1
>>
>>
>>Hallvard wrote recently:
>>
>>> We need official WWW standards to refer to the Robot Exclusion Standard
>>
>>I finally sat down and wrote a new specification of the Standard for
>>Robot Exclusion (at the expense of reading the list :-)
>>My focus is not on new features (although I did add Allow),
>>but on providing a more solid specification which addresses concerns of
>>ambiguity and completeness. As a result the spec itself looks complex,
>>though the format itself isn't; this is _not_ a user guide, that would
>>look completely different, and far simpler.
>>
>>FWIW, it also conforms to RFC format, so that it can be submitted to the
>>RFC editor once we're happy with it; though we're not a WG, I would like
>>to see some rough consesus first.
>>
>>In other words, this is more what I _should_ have written in 1994.
>>
>>I would really appreciate constructive criticism on this document.
>>After two days of writing I'm probably glazing over...
>>It is both appended and available on http://info.webcrawler.com/mak/
>>projects/robots/robots.html.
>>
>>Incidentally, I do expect this will make introduction of new features
>>much easier too, as diffs will make it a lot easier to spot potential
>>problems. (Like, anyone noticed '*' is legal in a URL path? :-)
>>
>>Regards,
>>
>>-- Martijn Koster
>>
>>Draft version 1, Fri Nov 15 14:46:55 PST 1996
>>
>>
>>
>>
>>
>>Network Working Group M. Koster
>>Request for Comments: NNNN WebCrawler
>>Category: Informational November 1996
>>
>>
>>
>> A Method for Robots Exclusion
>>
>>Status of this Memo
>>
>> This memo provides information for the Internet community. This memo
>> does not specify an Internet standard of any kind. Distribution of
>> this memo is unlimited.
>>
>>Table of Contents
>>
>> 1. Abstract . . . . . . . . . . . . . . . . . . . . . . . . . 1
>> 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . 1
>> 3. Specification . . . . . . . . . . . . . . . . . . . . . . . 1
>> 3.1 Access method . . . . . . . . . . . . . . . . . . . . . . . 1
>> 3.2 File Format Description . . . . . . . . . . . . . . . . . . 1
>> 3.2.1 The User-agent line . . . . . . . . . . . . . . . . . . . . 1
>> 3.2.2 The Allow and Disallow lines . . . . . . . . . . . . . . . 1
>> 3.3 File Format Syntax . . . . . . . . . . . . . . . . . . . . 1
>> 3.4 Expiration . . . . . . . . . . . . . . . . . . . . . . . . 1
>> 4. Implementor's Notes . . . . . . . . . . . . . . . . . . . . 1
>> 4.1 Backwards Compatibility . . . . . . . . . . . . . . . . . . 1
>> 4.2 Interoperability . . .. . . . . . . . . . . . . . . . . . . 1
>> 5. Security Considerations . . . . . . . . . . . . . . . . . . 1
>> 6. References . . . . . . . . . . . . . . . . . . . . . . . . 1
>> 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . 1
>> 8. Author's Address . . . . . . . . . . . . . . . . . . . . . 1
>>
>> [XXX once completed, break into pages and update page-numbers in table
>> of contents]
>>
>>1. Abstract
>>
>> This memo defines a method for administrators of sites on the World-
>> Wide Web to give instructions to visiting Web robots, most
>> importantly what areas of the site are to be avoided.
>>
>> This document provides a more rigid specification of the Standard
>> for Robots Exclusion [1], which is currently in wide-spread use by
>> the Web community since 1994. The specification here is intended to
>> replace said standard, while aiming to be backwards compatible.
>>
>>
>>2. Introduction
>>
>> Web Robots (also called "Wanderers" or "Spiders") are Web client
>> programs that automatically traverse the Web's hypertext structure
>> by retrieving a document, and recursively retrieving all documents
>> that are referenced.
>>
>> Note that "recursively" here doesn't limit the definition to any
>> specific traversal algorithm; even if a robot applies some heuristic
>> to the selection and order of documents to visit and spaces out
>> requests over a long space of time, it qualifies to be called a
>> robot.
>>
>> Robots are often used for maintenance and indexing purposes, by
>> people other than the administrators of the site being visited. In
>> some cases such visits may have undesirable effects which the
>> administrators would like to prevent, such as indexing of an
>> unannounced site, traversal of parts of the site which require vast
>> resources of the server, recursive traversal of an infinite URL
>> space, etc.
>>
>> The technique specified in this memo allows Web site administrators
>> to indicate to visiting robots which parts of the site should be
>> avoided. It is solely up to the visiting robot to consult this
>> information and act accordingly. Blocking parts of the Web site
>> regardless of a robot's compliance with this method are outside
>> the scope of this memo.
>>
>>
>>3. The Specification
>>
>> This memo specifies a format for encoding instructions to visiting
>> robots, and specifies an access method to retrieve these
>> instructions. Robots must retrieve these instructions before visiting
>> other URLs on the site, and use the instructions to determine if
>> other URLs on the site can be accessed.
>>
>>3.1 Access method
>>
>> The instructions must be accessible via HTTP [2] from the site that
>> the instructions are to be applied to, as a resource of Internet
>> Media Type [3] "text/plain" under a standard relative path on the
>> server: "/robots.txt".
>>
>> For convenience we will refer to this resource as the "/robots.txt
>> file", though the resource need in fact not originate from a file-
>> system.
>>
>> Some examples of URLs [4] for sites and URLs for corresponding
>> "/robots.txt" sites:
>>
>> http://www.foo.com/welcome.html http://www.foo.com/robots.txt
>>
>> http://www.bar.com:8001/ http://www.bar.com:8001/robots.txt
>>
>> If the server response indicates Success (HTTP 2xx Status Code,)
>> the robot must read the content, parse it, and follow any
>> instructions applicable to that robot.
>>
>> If the server response indicates the resource does not exist (HTTP
>> Status Code 404), the robot can assume no instructions are
>> available, and that access to the site is unrestricted.
>>
>> Specific behaviors for other server responses are not required by
>> this specification, though the following behaviours are recommended:
>>
>> - On server response indicating access restrictions (HTTP Status
>> Code 401 or 403) a robot should regard access to the site
>> completely restricted.
>>
>> - On the request attempt resulted in temporary failure a robot
>> should defer visits to the site until such time as the resource
>> can be retrieved.
>>
>> - On server response indicating Redirection (HTTP Status Code 3XX)
>> a robot should follow the redirects until a resource can be
>> found.
>>
>>
>>3.2 File Format Description
>>
>> The instructions are encoded as a formatted plain text object,
>> described here. A complete BNF-like description of the syntax of this
>> format is given in section 3.3.
>>
>> The format logically consists of a non-empty set or records,
>> separated by blank lines. The records consist of a set of lines of
>> the form:
>>
>> <Field> ":" <value>
>>
>> In this memo we refer to lines with a Field "foo" as "foo lines".
>>
>> The record starts with one or more User-agent lines, specifying
>> which robots the record applies to, followed by "Disallow" and
>> "Allow" instructions to that robot. For example:
>>
>> User-agent: webcrawler
>> User-agent: infoseek
>> Allow: /tmp/ok.html
>> Disallow: /tmp
>> Disallow: /user/foo
>>
>> These lines are discussed separately below.
>>
>> Comments are allowed anywhere in the file, and consist of a comment
>> character '#' followed by the comment, terminated by the end-of-line.
>>
>>3.2.1 The User-agent line
>>
>> The User-agent line indicates to which specific robots the record
>> applies.
>>
>> The line either specifies a simple name for a robot, or "*",
>> indicating this record is the default record for robots for which
>> no explicit User-agent line can be found in any of the records.
>>
>> The choice of a name(s) the robot scans for needs to be simple,
>> obvious and well documented. Robots should use the same name in the
>> User-agent field of a HTTP request, minus version information. Note
>> that the syntax for the token in the "/robots.txt" file is more
>> restrictive than the product token syntax for the HTTP User-agent
>> field.
>>
>> The name comparisons are case-insensitive.
>>
>> For example, a fictional company FigTree Search Services who names
>> their robot "Fig Tree", send HTTP requests like:
>>
>> GET / HTTP/1.0
>> User-agent: FigTree/0.1 Robot/1.0 libwww-perl/5.04
>>
>> might scan the "/robots.txt" file for records with:
>>
>> User-agent: figtree
>>
>> Where possible, robots should specify the name(s) they scan for in
>> included documentation.
>>
>>3.2.2 The Allow and Disallow lines
>>
>> These lines indicate whether accessing a URL that matches the
>> corresponding path is allowed or disallowed. Note that these
>> instructions apply to any HTTP method on a URL.
>>
>> To evaluate if a URL is allowed, a robot must attempt to match
>> the paths in Allow and Disallow lines against the URL, in the order
>> they occur in the record. The first match found is used.
>>
>> The matching process compares every octet in the path portion of
>> the URL and the path from the record. If a %xx encoded octet is
>> encountered it is unencoded prior to comparison, unless it is the
>> "/" character, which has special meaning in a path. The match
>> evaluates positively if and only if the end of the path from the
>> record is reached before a difference in octets is encountered.
>>
>> This table illustrates some examples:
>>
>> Record Path URL path Matches
>> /tmp /tmp yes
>> /tmp /tmp.html yes
>> /tmp /tmp/a.html yes
>> /tmp/ /tmp no
>> /tmp/ /tmp/ yes
>> /tmp/ /tmp/a.html yes
>> /a%3cd.html /a%3cd.html yes
>> /a%3Cd.html /a%3cd.html yes
>> /a%3cd.html /a%3Cd.html yes
>> /a%3cd.html /a%3cd.html yes
>> /a%2fb.html /a%2fb.html yes
>> /a%2fb.html /a%2Fb.html yes
>> /a%2fb.html /a/b.html no
>> /%7ejoe/index.html /~joe/index.html yes
>> /~joe/index.html /%7Ejoe/index.html yes
>>
>>3.3 File Format Syntax
>>
>> This is a BNF-like description, using the conventions of RFC 822 [5],
>> except that "|" is used to designate alternatives. Briefly, literals
>> are quoted with "", parentheses "(" and ")" are used to group
>> elements, optional elements are enclosed in [brackets], and elements
>> may be preceded with <n>* to designate n or more repetitions of the
>> following element; n defaults to 0.
>>
>> [XXX ought to feed through flex/bison to check rules; any takers?]
>>
>> robotstxt = *blankcomment
>> | *blankcomment record *( 1*commentblank 1*record )
>> *blankcomment
>> blankcomment = 1*(blank | commentline)
>> commentblank = *commentline blank *(blankcomment)
>> blank = *space CRLF
>> CRLF = CR LF
>> record = *commentline agentline *(commentline | agentline)
>> 1*ruleline *(commentline | ruleline)
>> agentline = "User-agent:" *space agent *space [comment] CRLF
>> ruleline = (disallowline | allowline)
>> disallowline = "Disallow:" *space path *space [comment] CRLF
>> allowline = "Allow:" *space rpath *space [comment] CRLF
>> commentline = comment CRLF
>> comment = "#" anychar
>> space = 1*(SP | HT)
>> rpath = "/" path
>> agent = token
>> anychar = <any CHAR except CR or LF>
>> CHAR = <any US-ASCII character (octets 0 - 127)>
>> CR = <US-ASCII CR, carriage return (13)>
>> LF = <US-ASCII LF, linefeed (10)>
>> SP = <US-ASCII SP, space (32)>
>> HT = <US-ASCII HT, horizontal-tab (9)>
>>
>> The syntax for "path" is defined in RFC 1808, reproduced here for
>> convenience:
>>
>> path = fsegment *( "/" segment )
>> fsegment = 1*pchar
>> segment = *pchar
>>
>> pchar = uchar | ":" | "@" | "&" | "="
>> uchar = unreserved | escape
>> unreserved = alpha | digit | safe | extra
>>
>> escape = "%" hex hex
>> hex = digit | "A" | "B" | "C" | "D" | "E" | "F" |
>> "a" | "b" | "c" | "d" | "e" | "f"
>>
>> alpha = lowalpha | hialpha
>> lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
>> "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
>> "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
>> hialpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
>> "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
>> "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
>>
>> digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
>> "8" | "9"
>>
>> safe = "$" | "-" | "_" | "." | "+"
>> extra = "!" | "*" | "'" | "(" | ")" | ","
>>
>> The syntax for "token" is taken from RFC 1945, reproduced here for
>> convenience:
>>
>> token = 1*<any CHAR except CTLs or tspecials>
>>
>> tspecials = "(" | ")" | "<" | ">" | "@"
>> | "," | ";" | ":" | "\" | <">
>> | "/" | "[" | "]" | "?" | "="
>> | "{" | "}" | SP | HT
>>
>>3.4 Expiration
>>
>> Robots should cache /robots.txt files, but if they do they must
>> periodically verify the cached copy is fresh before using its
>> contents.
>>
>> Standard HTTP cache-control mechanisms can be used by both origin
>> server and robots to influence the caching of the /robots.txt file.
>> Specifically robots should take note of Expires header set by the
>> origin server.
>>
>> If no cache-control directives are present robots should default to
>> an expiry of 7 days.
>>
>>
>>4. Notes for Implementors
>>
>>4.1 Backwards Compatibility
>>
>> Previous of this specification didn't provide the Allow line. The
>> introduction of the Allow line causes robots to behave slightly
>> differently under either specification:
>>
>> If a /robots.txt contains an Allow which overrides a later occurring
>> Disallow, a robot ignoring Allow lines will not retrieve those
>> parts. This is considered acceptable because there is no requirement
>> for a robot to access URLs it is allowed to retrieve, and it is safe,
>> in that no URLs a Web site administrator wants to Disallow are be
>> allowed. It is expected this may in fact encourage robots to upgrade
>> compliance to the specification in this memo.
>>
>>4.2 Interoperability
>>
>> Implementors should pay particular attention to the robustness in
>> parsing of the /robots.txt file. Web site administrators who are not
>> aware of the /robots.txt mechanisms often notice repeated failing
>> request for it in their log files, and react by putting up pages
>> asking "What are you looking for?".
>>
>> As the majority of /robots.txt files are created with platform-
>> specific text editors, robots should be liberal in accepting files
>> with different end-of-line conventions, specifically CR and LF in
>> addition to CRLF.
>>
>>
>>5. Security Considerations
>>
>> There are a few risks in the method described here, which may affect
>> either origin server or robot.
>>
>> Web site administrators must realise this method is voluntary, and
>> is not sufficient to guarantee some robots will not visit restricted
>> parts of the URL space. Failure to use proper authentication or other
>> restriction may result in exposure of restricted information. It even
>> possible that the occurence of paths in the /robots.txt file may
>> expose the existence of resources not otherwise linked to on the
>> site, which may aid people guessing for URLs.
>>
>> Robots need to be aware that the amount of resources spent on dealing
>> with the /robots.txt is a function of the file contents, which is not
>> under the control of the robot. To prevent denial-of-service attacks,
>> robots are therefore encouraged to place limits on the resources
>> spent on processing of /robots.txt.
>>
>>
>>6. Acknowledgements
>>
>> The author would like the subscribers to the robots mailing list for
>> their contributions to this specification.
>>
>>
>>7. References
>>
>> [1] Koster, M., "A Standard for Robot Exclusion",
>> http://info.webcrawler.com/mak/projects/robots/norobots.html,
>> June 1994.
>>
>> [2] Berners-Lee, T., Fielding, R., and Frystyk, H., "Hypertext
>> Transfer Protocol -- HTTP/1.0." RFC 1945, MIT/LCS, May 1996.
>>
>> [3] Postel, J., "Media Type Registration Procedure." RFC 1590,
>> USC/ISI, March 1994.
>>
>> [4] Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform
>> Resource Locators (URL)", RFC 1738, CERN, Xerox PARC,
>> University of Minnesota, December 1994.
>>
>> [5] Crocker, D., "Standard for the Format of ARPA Internet Text
>> Messages", STD 11, RFC 822, UDEL, August 1982.
>>
>> [6] Fielding, R., "Relative Uniform Resource Locators", RFC 1808,
>> UC Irvine, June 1995.
>>
>>8. Author's Address
>>
>> Martijn Koster
>> WebCrawler
>> America Online
>> 690 Fifth Street
>> San Francisco
>> CA 94107
>>
>> Phone: 415-3565431
>> EMail: m.koster@webcrawler.com
>>
>>-- Martijn
>>
>>Email: m.koster@webcrawler.com
>>WWW: http://info.webcrawler.com/mak/mak.html
>>
>>
>>_________________________________________________
>>This messages was sent by the robots mailing list. To unsubscribe, send mail
>>to robots-request@webcrawler.com with the word "unsubscribe" in the body.
>>For more info see http://info.webcrawler.com/mak/projects/robots/robots.html
>>
>_________________________________________________
>This messages was sent by the robots mailing list. To unsubscribe, send mail
>to robots-request@webcrawler.com with the word "unsubscribe" in the body.
>For more info see http://info.webcrawler.com/mak/projects/robots/robots.html
>
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html