Robots: 2
Or something that tells the robot what level (or revision) of the
robots.txt standard it supports. I very much doubt a version of robots.txt
with regular expressions (or wildcards) would mesh that well with the
current version.
But, with that said, some notions I've been playing with.
Given no nrobots.txt file, a robot can then access the entire site (I
believe that is the current policy). Given a nrobots.txt file of:
Robot: 2
Agent: *
Allow: *
A robot can access the entire site. Conversely,
Robot: 2
Agent: *
Disallow: *
A robot is verboten. Simple enough. But, given a pathological case like:
Robot: 2
Agent: *
Allow: *
Disallow: *
What to do? I'll get to what I believe should happen by using a more
complicated example, like the following:
Robot: 2
Agent: *
Disallow: *.shtml
Disallow: /blackhole/* ; may need to remove this
Allow: /blackhole/index.html
Allow: /blackhole/info*
Disallow: /blackhole/info99*
Disallow: /blackhole/info8.html
And a web site with the following files:
/blackhole/info98.shtml
/blackhole/info99.html
/blackhole/index.html
/blackhole/info98.html
/blackhole/info99.gif
/blackhole/info8.html
/blackhole/page3.html
/blackhole/info/index.html
/blackhole/info/page1.html
/blackhole/info/page2.shtml
The robot can then contruct two lists, one of Allow rules, and another of
Disallow rules:
Allowlist: /blackhole/index.html a1(e)
/blackhole/info* a2(r)
Disallowlist: *.shtml d1(r)
/blackhole/* d2(r)
/blackhole/info99* d3(r)
/blackhole/info8.html d4(e)
I've numbered the rules, and marked if they are regular expressions (r) or
explicit matches (e). We then have a choice to apply the Allow rules first
then the Disallow, or the Disallow rules followed by the Allow. Example
one:
Filter through Allowlist first:
/blackhole/info98.shtml pass a2(r)
/blackhole/info99.html pass a2(r)
/blackhole/index.html pass a1(e) - exit
/blackhole/info98.html pass a2(r)
/blackhole/info99.gif pass a2(r)
/blackhole/info8.html pass a2(r)
/blackhole/page3.html fail - exit
/blackhole/info/index.html pass a2(r)
/blackhole/info/page1.html pass a2(r)
/blackhole/info/page2.html pass a2(r)
I used the asumption that an explicit match (rule a1) has a higher
precedence and can exit once an explicit match is found. The rest (all
using regular expressions) then go to the Disallow rules:
Filter through Disallowlist:
/blackhole/info98.shtml fail d1(r)
/blackhole/info99.html fail d3(r)
/blackhole/info98.html fail d2(r) *
/blackhole/info99.gif fail d3(r)
/blackhole/info8.html fail d4(e)
/blackhole/info/index.html fail d2(r) *
/blackhole/info/page1.html fail d2(r) *
/blackhole/info/page2.shtml fail d2(r) *
Due to rule d2(r), the files marked with '*' fail, even though it may
appear they should be allowed. If rule d2 is removed, then these files will
be accepted (accepted meaning the robot can access said pages).
Now, example two will run through the Disallow rules first. If rule d2(r)
is still there, nothing gets through. But if d2(r) were removed, you get
the following:
Filter through Disallowlist first:
/blackhole/info98.shtml fail d1(r) - exit
/blackhole/info99.html fail d3(r) - exit
/blackhole/index.html pass
/blackhole/info98.html pass
/blackhole/info99.gif fail d3(r) - exit
/blackhole/info8.html fail d4(i) - exit
/blackhole/page3.html pass
/blackhole/info/index.html pass
/blackhole/info/page1.html pass
/blackhole/info/page2.shtml pass
In this case, upon a failure, we return, reguardless of type of rule
(explicit or regular expression). The rest are passed on to the Allow
rules:
Filter through Allow list
/blackhole/index.html pass - a1(i)
/blackhole/info98.html pass - a2(r)
/blackhole/page3.html fail
/blackhole/info/index.html pass - a2(r)
/blackhole/info/page1.html pass - a2(r)
/blackhole/info/page2.shtml pass - a2(r)
If you compare the results of the two examples (and assume that for both,
d2(r) is removed from the file) then the same files pass. So, given the
file:
Robots: 2
Agent: *
Allow: *
Disallow: *
No files will be allowed. If you use Allow->Disallow, then Allow will
accept all files, and Disallow will reject all files. If you use
Disallow->Allow, then Disallow still rejects all file and nothing ever gets
to Allow.
It may seem like one now can implement either order, but the following
rule set is ambiguous:
Robots: 2
Agent: *
Allow: /index.html
Disallow: *
If the robot implements Allow->Disallow then /index.html will be accepted.
But if it goes Disallow->Allow, it will be rejected. And then you have:
Robots: 2
Agent: *
Allow: /index*
Disallow: *
And again, nothing will get through. Which is why I differentiated
between explicit matches and regular expression matches. This would allow
the following acceptance table:
Allow explicit
Disallow
Allow regular expression
And will (unfortunately) force implementations to do an Allow->Disallow
path. But this is probably the cleanest with fewest surprises. There
currently is one case where there are no Allow rules and no Disallow rules
for certain files (in the example above, say, /index.html). What to do in
this case?
-spc (Just my two zorkmids worth)