RFC, draft 2 (was Re: RFC, draft 1

Martijn Koster (m.koster@webcrawler.com)
Thu, 21 Nov 1996 12:54:16 -0800


At 11:29 AM 11/19/96, Martijn Koster wrote:

>OK, back to the text edittor now....

I controlled myself, and didn't put any new features in.

http://info.webcrawler.com/mak/projects/robots/no-robots-rfc.txt

*** norobots-rfc-1.txt Thu Nov 21 09:35:32 1996
--- norobots-rfc-2.txt Thu Nov 21 12:48:43 1996
***************
*** 1,8 ****
- Draft version 1, Fri Nov 15 14:46:55 PST 1996
-

!

Network Working Group M. Koster
Request for Comments: NNNN WebCrawler
--- 1,16 ----

+ Draft version 2, Thu Nov 21 11:42:07 PST 1996
+ - new title to allow future expansion without rename
+ - extension mechanism (m.koster@webcrawler.com)
+ - regrouped/completed examples in 3.2 (dmckeon@swcp.com)
+ - default to Allow (keijil@microsoft.com)
+ - denial of service example (spc@armigeron.com)
+ - authentication warning (hardy@netscape.com)
+ - new User-agent text (h.b.furuseth@usit.uio.no)
+ - cosmetics

! Draft version 1, Fri Nov 15 14:46:55 PST 1996
! - original version

Network Working Group M. Koster
Request for Comments: NNNN WebCrawler
***************
*** 10,16 ****

! A Method for Robots Exclusion

Status of this Memo

--- 18,24 ----

! A Method for Web Robots Control

Status of this Memo

***************
*** 113,119 ****

If the server response indicates the resource does not exist (HTTP
Status Code 404), the robot can assume no instructions are
! available, and that access to the site is unrestricted.

Specific behaviors for other server responses are not required by
this specification, though the following behaviours are recommended:
--- 121,128 ----

If the server response indicates the resource does not exist (HTTP
Status Code 404), the robot can assume no instructions are
! available, and that access to the site is not restricted by
! /robots.txt.

Specific behaviors for other server responses are not required by
this specification, though the following behaviours are recommended:
***************
*** 144,150 ****
<Field> ":" <value>

In this memo we refer to lines with a Field "foo" as "foo lines".
!
The record starts with one or more User-agent lines, specifying
which robots the record applies to, followed by "Disallow" and
"Allow" instructions to that robot. For example:
--- 153,159 ----
<Field> ":" <value>

In this memo we refer to lines with a Field "foo" as "foo lines".
!
The record starts with one or more User-agent lines, specifying
which robots the record applies to, followed by "Disallow" and
"Allow" instructions to that robot. For example:
***************
*** 157,180 ****

These lines are discussed separately below.

! Comments are allowed anywhere in the file, and consist of a comment
! character '#' followed by the comment, terminated by the end-of-line.

3.2.1 The User-agent line

! The User-agent line indicates to which specific robots the record
! applies.
!
! The line either specifies a simple name for a robot, or "*",
! indicating this record is the default record for robots for which
! no explicit User-agent line can be found in any of the records.
!
! The choice of a name(s) the robot scans for needs to be simple,
! obvious and well documented. Robots should use the same name in the
! User-agent field of a HTTP request, minus version information. Note
! that the syntax for the token in the "/robots.txt" file is more
! restrictive than the product token syntax for the HTTP User-agent
! field.

The name comparisons are case-insensitive.

--- 166,197 ----

These lines are discussed separately below.

! Lines with Fields not explicitly specified by this specification
! may occur in the /robots.txt, allowing for future extension of the
! format. Consult the BNF for restrictions on the syntax of such
! extensions. Note specifically that for backwards compatibility
! with robots implementing earlier versions of this specification,
! breaking of lines is not allowed.
!
! Comments are allowed anywhere in the file, and consist of optional
! whitespace, followed by a comment character '#' followed by the
! comment, terminated by the end-of-line.

3.2.1 The User-agent line

! Name tokens are used to allow robots to identify themselves via a
! simple product token. Name tokens should be short and to the
! point. The name token a robot chooses for itself should be sent
! as part of the HTTP User-agent header, and must be well documented.
!
! These name tokens are used in User-agent lines in /robots.txt to
! identify to which specific robots the record applies. The robot
! must obey the first record in /robots.txt that contains a User-
! Agent line whose value contains the name token of the robot as a
! substring. The name comparisons are case-insensitive. If no such
! record exists, it should obey the first record with a User-agent
! line with a "*" value, if present. If no record satisfied either
! condition, or no records are present at all, access is unlimited.

The name comparisons are case-insensitive.

***************
*** 187,195 ****
might scan the "/robots.txt" file for records with:

User-agent: figtree
-
- Where possible, robots should specify the name(s) they scan for in
- included documentation.

3.2.2 The Allow and Disallow lines

--- 204,209 ----
***************
*** 197,205 ****
corresponding path is allowed or disallowed. Note that these
instructions apply to any HTTP method on a URL.

! To evaluate if a URL is allowed, a robot must attempt to match
! the paths in Allow and Disallow lines against the URL, in the order
! they occur in the record. The first match found is used.

The matching process compares every octet in the path portion of
the URL and the path from the record. If a %xx encoded octet is
--- 211,220 ----
corresponding path is allowed or disallowed. Note that these
instructions apply to any HTTP method on a URL.

! To evaluate if access to a URL is allowed, a robot must attempt to
! match the paths in Allow and Disallow lines against the URL, in the
! order they occur in the record. The first match found is used. If no
! match is found, the default assumption is that the URL is allowed.

The matching process compares every octet in the path portion of
the URL and the path from the record. If a %xx encoded octet is
***************
*** 217,229 ****
/tmp/ /tmp no
/tmp/ /tmp/ yes
/tmp/ /tmp/a.html yes
/a%3cd.html /a%3cd.html yes
/a%3Cd.html /a%3cd.html yes
/a%3cd.html /a%3Cd.html yes
! /a%3cd.html /a%3cd.html yes
/a%2fb.html /a%2fb.html yes
- /a%2fb.html /a%2Fb.html yes
/a%2fb.html /a/b.html no
/%7ejoe/index.html /~joe/index.html yes
/~joe/index.html /%7Ejoe/index.html yes

--- 232,248 ----
/tmp/ /tmp no
/tmp/ /tmp/ yes
/tmp/ /tmp/a.html yes
+
/a%3cd.html /a%3cd.html yes
/a%3Cd.html /a%3cd.html yes
/a%3cd.html /a%3Cd.html yes
! /a%3Cd.html /a%3Cd.html yes
!
/a%2fb.html /a%2fb.html yes
/a%2fb.html /a/b.html no
+ /a/b.html /a%2fb.html no
+ /a/b.html /a/b.html yes
+
/%7ejoe/index.html /~joe/index.html yes
/~joe/index.html /%7Ejoe/index.html yes

***************
*** 247,268 ****
CRLF = CR LF
record = *commentline agentline *(commentline | agentline)
1*ruleline *(commentline | ruleline)
! agentline = "User-agent:" *space agent *space [comment] CRLF
! ruleline = (disallowline | allowline)
! disallowline = "Disallow:" *space path *space [comment] CRLF
! allowline = "Allow:" *space rpath *space [comment] CRLF
commentline = comment CRLF
! comment = "#" anychar
space = 1*(SP | HT)
rpath = "/" path
agent = token
anychar = <any CHAR except CR or LF>
CHAR = <any US-ASCII character (octets 0 - 127)>
CR = <US-ASCII CR, carriage return (13)>
LF = <US-ASCII LF, linefeed (10)>
SP = <US-ASCII SP, space (32)>
HT = <US-ASCII HT, horizontal-tab (9)>

The syntax for "path" is defined in RFC 1808, reproduced here for
convenience:

--- 266,303 ----
CRLF = CR LF
record = *commentline agentline *(commentline | agentline)
1*ruleline *(commentline | ruleline)
!
! agentline = "User-agent:" *space agent [comment] CRLF
! ruleline = (disallowline | allowline | extension)
! disallowline = "Disallow" ":" *space path [comment] CRLF
! allowline = "Allow" ":" *space rpath [comment] CRLF
! extension = token : *space value [comment] CRLF
! value = <any CHAR except CR or LF or "#">
!
commentline = comment CRLF
! comment = *blank "#" anychar
space = 1*(SP | HT)
rpath = "/" path
agent = token
anychar = <any CHAR except CR or LF>
CHAR = <any US-ASCII character (octets 0 - 127)>
+ CTL = <any US-ASCII control character
+ (octets 0 - 31) and DEL (127)>
CR = <US-ASCII CR, carriage return (13)>
LF = <US-ASCII LF, linefeed (10)>
SP = <US-ASCII SP, space (32)>
HT = <US-ASCII HT, horizontal-tab (9)>

+ The syntax for "token" is taken from RFC 1945, reproduced here for
+ convenience:
+
+ token = 1*<any CHAR except CTLs or tspecials>
+
+ tspecials = "(" | ")" | "<" | ">" | "@"
+ | "," | ";" | ":" | "\" | <">
+ | "/" | "[" | "]" | "?" | "="
+ | "{" | "}" | SP | HT
+
The syntax for "path" is defined in RFC 1808, reproduced here for
convenience:

***************
*** 292,306 ****
safe = "$" | "-" | "_" | "." | "+"
extra = "!" | "*" | "'" | "(" | ")" | ","

- The syntax for "token" is taken from RFC 1945, reproduced here for
- convenience:
-
- token = 1*<any CHAR except CTLs or tspecials>
-
- tspecials = "(" | ")" | "<" | ">" | "@"
- | "," | ";" | ":" | "\" | <">
- | "/" | "[" | "]" | "?" | "="
- | "{" | "}" | SP | HT

3.4 Expiration

--- 327,332 ----
***************
*** 362,371 ****

Robots need to be aware that the amount of resources spent on dealing
with the /robots.txt is a function of the file contents, which is not
! under the control of the robot. To prevent denial-of-service attacks,
! robots are therefore encouraged to place limits on the resources
! spent on processing of /robots.txt.
!

6. Acknowledgements

--- 388,404 ----

Robots need to be aware that the amount of resources spent on dealing
with the /robots.txt is a function of the file contents, which is not
! under the control of the robot. For example, the contents may be
! larger in size than the robot can deal with. To prevent denial-of-
! service attacks, robots are therefore encouraged to place limits on
! the resources spent on processing of /robots.txt.
!
! The /robots.txt directives are retrieved and applied in separate,
! possible unauthenticated HTTP transactions, and it is possible that
! one server can impersonate another or otherwise intercept a
! /robots.txt, and provide a robot with false information. This
! specification does not preclude authentication and encryption
! from being employed to increase security.

6. Acknowledgements

-- Martijn

Email: m.koster@webcrawler.com
WWW: http://info.webcrawler.com/mak/mak.html

_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html