Re: robots.txt syntax

Martijn Koster (m.koster@webcrawler.com)
Sat, 12 Oct 1996 11:09:07 -0700


At 7:15 AM 10/11/96, Fred K. Lenherr wrote:

>Disallow: /
>Allow: /public/

So I was half-way through finally implementing this in C for WebCrawler,
when it occured to me that we may have a problem (beyond my hard disk
crashing at that very point, groan :-/) .

The obvious way to have the above work is follow the longest matching rule,
and pick one (Disallow) if we see the same path in both an Allow and a
Disallow. And of course you need to take special care of %%'s

But, then how do we later add wildcard support? For example, what does the
following mean?:

url = /foo.shtml
Allow: /foo
Disallow: *shtml

Using the length of the matching rule doesn't really make sense here.
We could make conflicting rules without wildcards have precedence over
rules containing wildcards, or even generalise to saying that a rule
with x wildcards has preference over a rule with y wildcards for x<y.
Hmmm... obscure, but probably the least surprise in the simple case.
No idea what to do if we were to allow full regexps.

Or we could make the order of the rules significant, and use the first
matching rule. Reminds me sort of of the NCSA httpd config file. It would
be a very clear method, with probably more expressive power (regexps
would be just fine). Also a bit obscure (eg in Fred's example the last
rule would never be matched :-), but very simple to explain.

Or we could ignore the problem altogether, and never add regexps :-)
I don't really think that's an option...

Sigh, the most expressive would be to make a /robots.txt.{perl,python,
java bytecode,lisp} with an API the robot can call to find out :-)
Alan Kay is right, but that's also not really an option.

So, what do you all think? Other argument against/in favour of any of
the above? Other solutions?

-- Martijn "fsck" Koster