Re: robots.txt syntax

Captain Napalm (spc@armigeron.com)
Sat, 12 Oct 1996 15:40:30 -0400 (EDT)


>From some obscure corner of the Matrix, Martijn Koster was seen transmitting:
>
> At 7:15 AM 10/11/96, Fred K. Lenherr wrote:
>
> >Disallow: /
> >Allow: /public/
>
> So I was half-way through finally implementing this in C for WebCrawler,
> when it occured to me that we may have a problem (beyond my hard disk
> crashing at that very point, groan :-/) .
>
> The obvious way to have the above work is follow the longest matching rule,
> and pick one (Disallow) if we see the same path in both an Allow and a
> Disallow. And of course you need to take special care of %%'s
>
> But, then how do we later add wildcard support? For example, what does the
> following mean?:
>
> url = /foo.shtml
> Allow: /foo
> Disallow: *shtml
>
> Using the length of the matching rule doesn't really make sense here.
> We could make conflicting rules without wildcards have precedence over
> rules containing wildcards, or even generalise to saying that a rule
> with x wildcards has preference over a rule with y wildcards for x<y.
> Hmmm... obscure, but probably the least surprise in the simple case.
> No idea what to do if we were to allow full regexps.
>
> Or we could make the order of the rules significant, and use the first
> matching rule. Reminds me sort of of the NCSA httpd config file. It would
> be a very clear method, with probably more expressive power (regexps
> would be just fine). Also a bit obscure (eg in Fred's example the last
> rule would never be matched :-), but very simple to explain.
>
> So, what do you all think? Other argument against/in favour of any of
> the above? Other solutions?
>
First off, I would propose that a new name for the new format be proposed,
something like nrobots.txt, AND that it contain

Robots: 2

Or something that tells the robot what level (or revision) of the
robots.txt standard it supports. I very much doubt a version of robots.txt
with regular expressions (or wildcards) would mesh that well with the
current version.

But, with that said, some notions I've been playing with.

Given no nrobots.txt file, a robot can then access the entire site (I
believe that is the current policy). Given a nrobots.txt file of:

Robot: 2
Agent: *
Allow: *

A robot can access the entire site. Conversely,

Robot: 2
Agent: *
Disallow: *

A robot is verboten. Simple enough. But, given a pathological case like:

Robot: 2
Agent: *
Allow: *
Disallow: *

What to do? I'll get to what I believe should happen by using a more
complicated example, like the following:

Robot: 2
Agent: *
Disallow: *.shtml
Disallow: /blackhole/* ; may need to remove this
Allow: /blackhole/index.html
Allow: /blackhole/info*
Disallow: /blackhole/info99*
Disallow: /blackhole/info8.html

And a web site with the following files:

/blackhole/info98.shtml
/blackhole/info99.html
/blackhole/index.html
/blackhole/info98.html
/blackhole/info99.gif
/blackhole/info8.html
/blackhole/page3.html
/blackhole/info/index.html
/blackhole/info/page1.html
/blackhole/info/page2.shtml

The robot can then contruct two lists, one of Allow rules, and another of
Disallow rules:

Allowlist: /blackhole/index.html a1(e)
/blackhole/info* a2(r)

Disallowlist: *.shtml d1(r)
/blackhole/* d2(r)
/blackhole/info99* d3(r)
/blackhole/info8.html d4(e)

I've numbered the rules, and marked if they are regular expressions (r) or
explicit matches (e). We then have a choice to apply the Allow rules first
then the Disallow, or the Disallow rules followed by the Allow. Example
one:

Filter through Allowlist first:
/blackhole/info98.shtml pass a2(r)
/blackhole/info99.html pass a2(r)
/blackhole/index.html pass a1(e) - exit
/blackhole/info98.html pass a2(r)
/blackhole/info99.gif pass a2(r)
/blackhole/info8.html pass a2(r)
/blackhole/page3.html fail - exit
/blackhole/info/index.html pass a2(r)
/blackhole/info/page1.html pass a2(r)
/blackhole/info/page2.html pass a2(r)

I used the asumption that an explicit match (rule a1) has a higher
precedence and can exit once an explicit match is found. The rest (all
using regular expressions) then go to the Disallow rules:

Filter through Disallowlist:
/blackhole/info98.shtml fail d1(r)
/blackhole/info99.html fail d3(r)
/blackhole/info98.html fail d2(r) *
/blackhole/info99.gif fail d3(r)
/blackhole/info8.html fail d4(e)
/blackhole/info/index.html fail d2(r) *
/blackhole/info/page1.html fail d2(r) *
/blackhole/info/page2.shtml fail d2(r) *

Due to rule d2(r), the files marked with '*' fail, even though it may
appear they should be allowed. If rule d2 is removed, then these files will
be accepted (accepted meaning the robot can access said pages).

Now, example two will run through the Disallow rules first. If rule d2(r)
is still there, nothing gets through. But if d2(r) were removed, you get
the following:

Filter through Disallowlist first:
/blackhole/info98.shtml fail d1(r) - exit
/blackhole/info99.html fail d3(r) - exit
/blackhole/index.html pass
/blackhole/info98.html pass
/blackhole/info99.gif fail d3(r) - exit
/blackhole/info8.html fail d4(i) - exit
/blackhole/page3.html pass
/blackhole/info/index.html pass
/blackhole/info/page1.html pass
/blackhole/info/page2.shtml pass

In this case, upon a failure, we return, reguardless of type of rule
(explicit or regular expression). The rest are passed on to the Allow
rules:

Filter through Allow list
/blackhole/index.html pass - a1(i)
/blackhole/info98.html pass - a2(r)
/blackhole/page3.html fail
/blackhole/info/index.html pass - a2(r)
/blackhole/info/page1.html pass - a2(r)
/blackhole/info/page2.shtml pass - a2(r)

If you compare the results of the two examples (and assume that for both,
d2(r) is removed from the file) then the same files pass. So, given the
file:

Robots: 2
Agent: *
Allow: *
Disallow: *

No files will be allowed. If you use Allow->Disallow, then Allow will
accept all files, and Disallow will reject all files. If you use
Disallow->Allow, then Disallow still rejects all file and nothing ever gets
to Allow.

It may seem like one now can implement either order, but the following
rule set is ambiguous:

Robots: 2
Agent: *
Allow: /index.html
Disallow: *

If the robot implements Allow->Disallow then /index.html will be accepted.
But if it goes Disallow->Allow, it will be rejected. And then you have:

Robots: 2
Agent: *
Allow: /index*
Disallow: *

And again, nothing will get through. Which is why I differentiated
between explicit matches and regular expression matches. This would allow
the following acceptance table:

Allow explicit
Disallow
Allow regular expression

And will (unfortunately) force implementations to do an Allow->Disallow
path. But this is probably the cleanest with fewest surprises. There
currently is one case where there are no Allow rules and no Disallow rules
for certain files (in the example above, say, /index.html). What to do in
this case?

-spc (Just my two zorkmids worth)