I'd there like to propose a slight modification to what I proposed
earlier.
The current format of robots.txt is:
# comment - can appear almost anywhere
User-agent: agentname # comment
Disallow: simplepattern
User-agent: * # match all agents not explicitly ment.
Disallow: /
The simple pattern is basically equivilent to the following regular
expression:
simplepattern*
ie. Literal match followed by anything.
This would be considered Version 1.0.0 of the robots.txt convention, which
all robots would have to follow. The new extensions would start with a
version of 2.0.0 and can be included in existing 1.0.0 robots.txt files
(assuming the parser follows the conventions and ignores any unknown
headers). It would work as follows:
# -------------------------------------------
# the following robot only understands 1.0.0
# -------------------------------------------
User-agent: Fredsbot # if it don't understand 2.0.0
Disallow: / # it's unwelcome here
# ---------------------------------------------
# the following robot does understand 2.0.0
# ---------------------------------------------
User-agent: Chives
Robot-version: 2.0.0
Allow: */index.html # only allow index pages
# ----------------------------------------------
# So does this one, but we know it works, so we
# allow a bit more freedom for this one
# ----------------------------------------------
User-agent: Alfred
Robot-version: 2.0.0
Allow: /blackhole/index.html
Allow: /blackhole/info*
Disallow: *.shtml
Disallow: /blackhole/info99*
Disallow: /blackhole/info8.html
# ------------------------------------------------
# anything else we don't trust
# -------------------------------------------------
User-agent: *
Disallow: /
First off, if the next valid header after 'User-agent' is NOT
'Robot-version', then treat the entry as a version 1.0.0 of the robots.txt
(current standard). If a robot does understand 2.0.0, then it should treat
match string as if '*' were added to the end (which is how is current
works).
If the next valid header after 'User-agent' IS 'Robot-version' we then can
read the version string. It CAN contain a version of '1.0.0' (which should
be ignored by any robots that don't understand the header).
In my proposed 2.0.0 robots.txt standard, there are three forms of
matching strings, one of which exists for 1.0.0 compatibility. Explicit
matches are those which do not contain any wildcard/regular expression
characters. Regular expression matches are those that do contain
wildcard/regular expression characters. General matches do NOT contain
wildcard/regular expression characters, but are treated as if they contain
the character that matches zero or more characters as the last character of
the string. They exist solely for 1.0.0 compatibility and the context of
when they're used is easily determined.
The following tags are used in defining the data for each header:
<general> - the 1.0.0 general match string format
<explicit> - the 2.0.0 explicit match string format
<regex> - the 2.0.0 wildcard/regular expression match string format
<version> - the version of the robots.txt standard (see
Robot-version: for more details)
Unless otherwise stated, more than one header can appear in a rule set.
The following headers are defined (or redefined) for 2.0.0:
User-agent:
Format:
User-agent: <general> # 1.0.0 , 2.0.0
Comment:
This is the same format as the 1.0.0 version, with the
added note that the default rule set should follow the
1.0.0 standard format of:
User-agent: *
Disallow: /
Robot-version:
Format:
Robot-version: <version> # 2.0.0
Comment:
The version is a three part number, with each part
separated by a period.
The first part is the major version number of the
robots.txt standard. Only drastic changes to the
standard shall cause this number to be increased.
Valid numbers for this part are 1 and 2.
The second version is for partial upgrades, clarifications
or small added extensions. My intent is to follow the
Linux Kernel numbering convention here and have even
numbers be 'stable' (or agreed upon) standards, and odd
numbers to be 'experimental', with possible differing
interpretations of headers.
The final number is a revision of the current major and
minor numbers. It is hoped that this number will be 0
for 'even' versions of the robots.txt standard.
This will follow the 'User-agent:' header. If it does
not immediately follow, or is missing, then the robot
is to assume the rule set follows the 1.0.0 standard.
Only ONE 'Robot-version:' header per rule set is allowed.
A version number of 1.0.0 is allowed.
When checking the version number, a robot can assume (if
the second digit is even) that a higher version number than
its looking for is okay (i.e. if a robot is looking for
version 2.0.0 and comes across 2.2.0, then it can still
use the rule set).
If a robot comes across a lower version number, then it
will have to correctly parse the headers according to
that version.
A robot, if it comes across an experiment version number,
should probably ignore that rule set and use the default.
Allow:
Format:
Allow: <explicit> # 2.0.0
Allow: <regex> # 2.0.0
Comments:
An explicit match string has the highest precedence and
grants the robot the explicit permission to retrieve the
URL stated.
A wildcard/regular expression has the lowest precedence
and only grants the robot permission to retrieve the URLs
matching ONLY if any disallow rules do not filter out the
URL (see Disallow:).
If there are no disallow rules, then the robot is ONLY
allowed to retrieve the URLs that match the explicit
and/or wildcard/regular expressions given.
Disallow:
Format:
Disallow: <general> # 1.0.0
Disallow: <explicit> # 2.0.0
Disallow: <regex> # 2.0.0
Comment:
[1.0.0] See the current robots.txt for the 1.0.0 behavior,
except to note that a general match can be turned into a
wild card/regular expression match by adding a '*' to the
end of the string.
[2.0.0] Any URL matching the explicit match or the wild
card/regular expression is NOT to be retreived.
If there are no allow rules, then any URL not matching the
rule(s) can be retrieved by the robot.
If there are allow rules, then explicit allows have a higher
precedence than a disallow rule. Disallow rules have a
higher precedence than wild card/regular expression rules.
Any URL not matching the disallow rules have to then pass
(any) wild card/regular expression rules. If there are
no wild card/regular expression rules, then
we have a choice here:
1. nothing else is allowed
2. everything else is allowed
Discuss.
Now, for some headers that might be useful, but can wait:
Visit-time:
Format:
Visit-time: <time> <time> # 2.0.0
Comment:
The robot is requested to only visit the site between the
given times. If the robot visits outside of this time,
it should notify its author/user that the site only
wants it between the times specified.
This can only appear once per rule set.
The format for <time> has to be worked out. At the
least, use GMT (or whatever it is called now).
Request-rate:
Format:
Request-rate: <rate> # 2.0.0
Request-rate: <rate> <time> <time> # 2.0.0
Comment:
<rate> is defined as <numdocuments> '/' <timeunit>. The
default time unit is the second.
<time> format is to be worked out.
An example of some rates:
10/60 - no more than 10 documents per 60 secs
10/10m - no more than 10 documents per 10 mins
20/1h - no more than 20 documents per hour
If time is given, then the robot is to use the given
rate (and no faster) if the time is between the times given.
If more than one 'Request-rate:' header is given and does
NOT include the time, use the one that requests the
fewest documents per unit of time.
If no 'Request-rate:' is given, then the robot is encouraged
to use the following rule of thumb for time between
requests (which seems reasonable to me):
twice the amount of time it took to retrieve
the document
10 seconds
whichever is slower.
Comment:
Format:
Comment: <string to EOLN> # 2.0.0
Comment:
These are comments that the robot is encouraged to send
back to the author/user of the robot. All 'Comment:'s
in a rule set are to be sent back (at least, that's the
intention). This can be used to explain the robot policy
of a site (say, that one government site that hates
robots).
Oh, gee ... will ya look at the time. Anyway, enough goofing off from
work. Sigh.
-spc (It seems reasonable to me at least)