I suggest that the intention was that no match implies that the agent is
allowed.
Otherwise, a terminal "Allow: /" is needed to allow crawling of the
whole site (or the rest of the site, I should say) by default. This is
likely to cause a lot of grief, since it would be all too easy to forget
adding that terminal "Allow: /". Especially since the implicit default
is currently just that.
You could argue that not allowing as a default is good manners, but you
can easily imagine the opposite: that a site starts complaining to you
that their content doesn't come up in searches, and the reason was that
they forgot to put the last "Allow: /" in the access list. After the
20th site where you manually check their robots.txt and write them back
that they forgot the terminal "Allow: /", you will definitely wish that
the default had been set to allow all.
>----------
>From: m.koster@webcrawler.com[SMTP:m.koster@webcrawler.com]
>Sent: Friday, November 15, 1996 5:02 PM
>To: robots@webcrawler.com
>Subject: RFC, draft 1
>
>
>Hallvard wrote recently:
>
>> We need official WWW standards to refer to the Robot Exclusion Standard
>
>I finally sat down and wrote a new specification of the Standard for
>Robot Exclusion (at the expense of reading the list :-)
>My focus is not on new features (although I did add Allow),
>but on providing a more solid specification which addresses concerns of
>ambiguity and completeness. As a result the spec itself looks complex,
>though the format itself isn't; this is _not_ a user guide, that would
>look completely different, and far simpler.
>
>FWIW, it also conforms to RFC format, so that it can be submitted to the
>RFC editor once we're happy with it; though we're not a WG, I would like
>to see some rough consesus first.
>
>In other words, this is more what I _should_ have written in 1994.
>
>I would really appreciate constructive criticism on this document.
>After two days of writing I'm probably glazing over...
>It is both appended and available on http://info.webcrawler.com/mak/
>projects/robots/robots.html.
>
>Incidentally, I do expect this will make introduction of new features
>much easier too, as diffs will make it a lot easier to spot potential
>problems. (Like, anyone noticed '*' is legal in a URL path? :-)
>
>Regards,
>
>-- Martijn Koster
>
>Draft version 1, Fri Nov 15 14:46:55 PST 1996
>
>
>
>
>
>Network Working Group M. Koster
>Request for Comments: NNNN WebCrawler
>Category: Informational November 1996
>
>
>
> A Method for Robots Exclusion
>
>Status of this Memo
>
> This memo provides information for the Internet community. This memo
> does not specify an Internet standard of any kind. Distribution of
> this memo is unlimited.
>
>Table of Contents
>
> 1. Abstract . . . . . . . . . . . . . . . . . . . . . . . . . 1
> 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . 1
> 3. Specification . . . . . . . . . . . . . . . . . . . . . . . 1
> 3.1 Access method . . . . . . . . . . . . . . . . . . . . . . . 1
> 3.2 File Format Description . . . . . . . . . . . . . . . . . . 1
> 3.2.1 The User-agent line . . . . . . . . . . . . . . . . . . . . 1
> 3.2.2 The Allow and Disallow lines . . . . . . . . . . . . . . . 1
> 3.3 File Format Syntax . . . . . . . . . . . . . . . . . . . . 1
> 3.4 Expiration . . . . . . . . . . . . . . . . . . . . . . . . 1
> 4. Implementor's Notes . . . . . . . . . . . . . . . . . . . . 1
> 4.1 Backwards Compatibility . . . . . . . . . . . . . . . . . . 1
> 4.2 Interoperability . . .. . . . . . . . . . . . . . . . . . . 1
> 5. Security Considerations . . . . . . . . . . . . . . . . . . 1
> 6. References . . . . . . . . . . . . . . . . . . . . . . . . 1
> 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . 1
> 8. Author's Address . . . . . . . . . . . . . . . . . . . . . 1
>
> [XXX once completed, break into pages and update page-numbers in table
> of contents]
>
>1. Abstract
>
> This memo defines a method for administrators of sites on the World-
> Wide Web to give instructions to visiting Web robots, most
> importantly what areas of the site are to be avoided.
>
> This document provides a more rigid specification of the Standard
> for Robots Exclusion [1], which is currently in wide-spread use by
> the Web community since 1994. The specification here is intended to
> replace said standard, while aiming to be backwards compatible.
>
>
>2. Introduction
>
> Web Robots (also called "Wanderers" or "Spiders") are Web client
> programs that automatically traverse the Web's hypertext structure
> by retrieving a document, and recursively retrieving all documents
> that are referenced.
>
> Note that "recursively" here doesn't limit the definition to any
> specific traversal algorithm; even if a robot applies some heuristic
> to the selection and order of documents to visit and spaces out
> requests over a long space of time, it qualifies to be called a
> robot.
>
> Robots are often used for maintenance and indexing purposes, by
> people other than the administrators of the site being visited. In
> some cases such visits may have undesirable effects which the
> administrators would like to prevent, such as indexing of an
> unannounced site, traversal of parts of the site which require vast
> resources of the server, recursive traversal of an infinite URL
> space, etc.
>
> The technique specified in this memo allows Web site administrators
> to indicate to visiting robots which parts of the site should be
> avoided. It is solely up to the visiting robot to consult this
> information and act accordingly. Blocking parts of the Web site
> regardless of a robot's compliance with this method are outside
> the scope of this memo.
>
>
>3. The Specification
>
> This memo specifies a format for encoding instructions to visiting
> robots, and specifies an access method to retrieve these
> instructions. Robots must retrieve these instructions before visiting
> other URLs on the site, and use the instructions to determine if
> other URLs on the site can be accessed.
>
>3.1 Access method
>
> The instructions must be accessible via HTTP [2] from the site that
> the instructions are to be applied to, as a resource of Internet
> Media Type [3] "text/plain" under a standard relative path on the
> server: "/robots.txt".
>
> For convenience we will refer to this resource as the "/robots.txt
> file", though the resource need in fact not originate from a file-
> system.
>
> Some examples of URLs [4] for sites and URLs for corresponding
> "/robots.txt" sites:
>
> http://www.foo.com/welcome.html http://www.foo.com/robots.txt
>
> http://www.bar.com:8001/ http://www.bar.com:8001/robots.txt
>
> If the server response indicates Success (HTTP 2xx Status Code,)
> the robot must read the content, parse it, and follow any
> instructions applicable to that robot.
>
> If the server response indicates the resource does not exist (HTTP
> Status Code 404), the robot can assume no instructions are
> available, and that access to the site is unrestricted.
>
> Specific behaviors for other server responses are not required by
> this specification, though the following behaviours are recommended:
>
> - On server response indicating access restrictions (HTTP Status
> Code 401 or 403) a robot should regard access to the site
> completely restricted.
>
> - On the request attempt resulted in temporary failure a robot
> should defer visits to the site until such time as the resource
> can be retrieved.
>
> - On server response indicating Redirection (HTTP Status Code 3XX)
> a robot should follow the redirects until a resource can be
> found.
>
>
>3.2 File Format Description
>
> The instructions are encoded as a formatted plain text object,
> described here. A complete BNF-like description of the syntax of this
> format is given in section 3.3.
>
> The format logically consists of a non-empty set or records,
> separated by blank lines. The records consist of a set of lines of
> the form:
>
> <Field> ":" <value>
>
> In this memo we refer to lines with a Field "foo" as "foo lines".
>
> The record starts with one or more User-agent lines, specifying
> which robots the record applies to, followed by "Disallow" and
> "Allow" instructions to that robot. For example:
>
> User-agent: webcrawler
> User-agent: infoseek
> Allow: /tmp/ok.html
> Disallow: /tmp
> Disallow: /user/foo
>
> These lines are discussed separately below.
>
> Comments are allowed anywhere in the file, and consist of a comment
> character '#' followed by the comment, terminated by the end-of-line.
>
>3.2.1 The User-agent line
>
> The User-agent line indicates to which specific robots the record
> applies.
>
> The line either specifies a simple name for a robot, or "*",
> indicating this record is the default record for robots for which
> no explicit User-agent line can be found in any of the records.
>
> The choice of a name(s) the robot scans for needs to be simple,
> obvious and well documented. Robots should use the same name in the
> User-agent field of a HTTP request, minus version information. Note
> that the syntax for the token in the "/robots.txt" file is more
> restrictive than the product token syntax for the HTTP User-agent
> field.
>
> The name comparisons are case-insensitive.
>
> For example, a fictional company FigTree Search Services who names
> their robot "Fig Tree", send HTTP requests like:
>
> GET / HTTP/1.0
> User-agent: FigTree/0.1 Robot/1.0 libwww-perl/5.04
>
> might scan the "/robots.txt" file for records with:
>
> User-agent: figtree
>
> Where possible, robots should specify the name(s) they scan for in
> included documentation.
>
>3.2.2 The Allow and Disallow lines
>
> These lines indicate whether accessing a URL that matches the
> corresponding path is allowed or disallowed. Note that these
> instructions apply to any HTTP method on a URL.
>
> To evaluate if a URL is allowed, a robot must attempt to match
> the paths in Allow and Disallow lines against the URL, in the order
> they occur in the record. The first match found is used.
>
> The matching process compares every octet in the path portion of
> the URL and the path from the record. If a %xx encoded octet is
> encountered it is unencoded prior to comparison, unless it is the
> "/" character, which has special meaning in a path. The match
> evaluates positively if and only if the end of the path from the
> record is reached before a difference in octets is encountered.
>
> This table illustrates some examples:
>
> Record Path URL path Matches
> /tmp /tmp yes
> /tmp /tmp.html yes
> /tmp /tmp/a.html yes
> /tmp/ /tmp no
> /tmp/ /tmp/ yes
> /tmp/ /tmp/a.html yes
> /a%3cd.html /a%3cd.html yes
> /a%3Cd.html /a%3cd.html yes
> /a%3cd.html /a%3Cd.html yes
> /a%3cd.html /a%3cd.html yes
> /a%2fb.html /a%2fb.html yes
> /a%2fb.html /a%2Fb.html yes
> /a%2fb.html /a/b.html no
> /%7ejoe/index.html /~joe/index.html yes
> /~joe/index.html /%7Ejoe/index.html yes
>
>3.3 File Format Syntax
>
> This is a BNF-like description, using the conventions of RFC 822 [5],
> except that "|" is used to designate alternatives. Briefly, literals
> are quoted with "", parentheses "(" and ")" are used to group
> elements, optional elements are enclosed in [brackets], and elements
> may be preceded with <n>* to designate n or more repetitions of the
> following element; n defaults to 0.
>
> [XXX ought to feed through flex/bison to check rules; any takers?]
>
> robotstxt = *blankcomment
> | *blankcomment record *( 1*commentblank 1*record )
> *blankcomment
> blankcomment = 1*(blank | commentline)
> commentblank = *commentline blank *(blankcomment)
> blank = *space CRLF
> CRLF = CR LF
> record = *commentline agentline *(commentline | agentline)
> 1*ruleline *(commentline | ruleline)
> agentline = "User-agent:" *space agent *space [comment] CRLF
> ruleline = (disallowline | allowline)
> disallowline = "Disallow:" *space path *space [comment] CRLF
> allowline = "Allow:" *space rpath *space [comment] CRLF
> commentline = comment CRLF
> comment = "#" anychar
> space = 1*(SP | HT)
> rpath = "/" path
> agent = token
> anychar = <any CHAR except CR or LF>
> CHAR = <any US-ASCII character (octets 0 - 127)>
> CR = <US-ASCII CR, carriage return (13)>
> LF = <US-ASCII LF, linefeed (10)>
> SP = <US-ASCII SP, space (32)>
> HT = <US-ASCII HT, horizontal-tab (9)>
>
> The syntax for "path" is defined in RFC 1808, reproduced here for
> convenience:
>
> path = fsegment *( "/" segment )
> fsegment = 1*pchar
> segment = *pchar
>
> pchar = uchar | ":" | "@" | "&" | "="
> uchar = unreserved | escape
> unreserved = alpha | digit | safe | extra
>
> escape = "%" hex hex
> hex = digit | "A" | "B" | "C" | "D" | "E" | "F" |
> "a" | "b" | "c" | "d" | "e" | "f"
>
> alpha = lowalpha | hialpha
> lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
> "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
> "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
> hialpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
> "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
> "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
>
> digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
> "8" | "9"
>
> safe = "$" | "-" | "_" | "." | "+"
> extra = "!" | "*" | "'" | "(" | ")" | ","
>
> The syntax for "token" is taken from RFC 1945, reproduced here for
> convenience:
>
> token = 1*<any CHAR except CTLs or tspecials>
>
> tspecials = "(" | ")" | "<" | ">" | "@"
> | "," | ";" | ":" | "\" | <">
> | "/" | "[" | "]" | "?" | "="
> | "{" | "}" | SP | HT
>
>3.4 Expiration
>
> Robots should cache /robots.txt files, but if they do they must
> periodically verify the cached copy is fresh before using its
> contents.
>
> Standard HTTP cache-control mechanisms can be used by both origin
> server and robots to influence the caching of the /robots.txt file.
> Specifically robots should take note of Expires header set by the
> origin server.
>
> If no cache-control directives are present robots should default to
> an expiry of 7 days.
>
>
>4. Notes for Implementors
>
>4.1 Backwards Compatibility
>
> Previous of this specification didn't provide the Allow line. The
> introduction of the Allow line causes robots to behave slightly
> differently under either specification:
>
> If a /robots.txt contains an Allow which overrides a later occurring
> Disallow, a robot ignoring Allow lines will not retrieve those
> parts. This is considered acceptable because there is no requirement
> for a robot to access URLs it is allowed to retrieve, and it is safe,
> in that no URLs a Web site administrator wants to Disallow are be
> allowed. It is expected this may in fact encourage robots to upgrade
> compliance to the specification in this memo.
>
>4.2 Interoperability
>
> Implementors should pay particular attention to the robustness in
> parsing of the /robots.txt file. Web site administrators who are not
> aware of the /robots.txt mechanisms often notice repeated failing
> request for it in their log files, and react by putting up pages
> asking "What are you looking for?".
>
> As the majority of /robots.txt files are created with platform-
> specific text editors, robots should be liberal in accepting files
> with different end-of-line conventions, specifically CR and LF in
> addition to CRLF.
>
>
>5. Security Considerations
>
> There are a few risks in the method described here, which may affect
> either origin server or robot.
>
> Web site administrators must realise this method is voluntary, and
> is not sufficient to guarantee some robots will not visit restricted
> parts of the URL space. Failure to use proper authentication or other
> restriction may result in exposure of restricted information. It even
> possible that the occurence of paths in the /robots.txt file may
> expose the existence of resources not otherwise linked to on the
> site, which may aid people guessing for URLs.
>
> Robots need to be aware that the amount of resources spent on dealing
> with the /robots.txt is a function of the file contents, which is not
> under the control of the robot. To prevent denial-of-service attacks,
> robots are therefore encouraged to place limits on the resources
> spent on processing of /robots.txt.
>
>
>6. Acknowledgements
>
> The author would like the subscribers to the robots mailing list for
> their contributions to this specification.
>
>
>7. References
>
> [1] Koster, M., "A Standard for Robot Exclusion",
> http://info.webcrawler.com/mak/projects/robots/norobots.html,
> June 1994.
>
> [2] Berners-Lee, T., Fielding, R., and Frystyk, H., "Hypertext
> Transfer Protocol -- HTTP/1.0." RFC 1945, MIT/LCS, May 1996.
>
> [3] Postel, J., "Media Type Registration Procedure." RFC 1590,
> USC/ISI, March 1994.
>
> [4] Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform
> Resource Locators (URL)", RFC 1738, CERN, Xerox PARC,
> University of Minnesota, December 1994.
>
> [5] Crocker, D., "Standard for the Format of ARPA Internet Text
> Messages", STD 11, RFC 822, UDEL, August 1982.
>
> [6] Fielding, R., "Relative Uniform Resource Locators", RFC 1808,
> UC Irvine, June 1995.
>
>8. Author's Address
>
> Martijn Koster
> WebCrawler
> America Online
> 690 Fifth Street
> San Francisco
> CA 94107
>
> Phone: 415-3565431
> EMail: m.koster@webcrawler.com
>
>-- Martijn
>
>Email: m.koster@webcrawler.com
>WWW: http://info.webcrawler.com/mak/mak.html
>
>
>_________________________________________________
>This messages was sent by the robots mailing list. To unsubscribe, send mail
>to robots-request@webcrawler.com with the word "unsubscribe" in the body.
>For more info see http://info.webcrawler.com/mak/projects/robots/robots.html
>
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html