Re: robots.txt syntax

John D. Pritchard (jdp@cs.columbia.edu)
Mon, 14 Oct 1996 02:58:18 -0400


napalm,

that's way too complicated...

if disallow rules take precedence, everything's simple. if both rule types
are matching, then the request is disallowed. i don't think this impedes
the expressiveness of the resulting rule language. the rule language then
has disallow as an absolute disallow rather than a fuzzy, implementation
dependent disallow. :-)

-john

> >From some obscure corner of the Matrix, Martijn Koster was seen transmitting:
> >
> > At 7:15 AM 10/11/96, Fred K. Lenherr wrote:
> >
> > >Disallow: /
> > >Allow: /public/
> >
> > So I was half-way through finally implementing this in C for WebCrawler,
> > when it occured to me that we may have a problem (beyond my hard disk
> > crashing at that very point, groan :-/) .
> >
> > The obvious way to have the above work is follow the longest matching rule,
> > and pick one (Disallow) if we see the same path in both an Allow and a
> > Disallow. And of course you need to take special care of %%'s
> >
> > But, then how do we later add wildcard support? For example, what does the
> > following mean?:
> >
> > url = /foo.shtml
> > Allow: /foo
> > Disallow: *shtml
> >
> > Using the length of the matching rule doesn't really make sense here.
> > We could make conflicting rules without wildcards have precedence over
> > rules containing wildcards, or even generalise to saying that a rule
> > with x wildcards has preference over a rule with y wildcards for x<y.
> > Hmmm... obscure, but probably the least surprise in the simple case.
> > No idea what to do if we were to allow full regexps.
> >
> > Or we could make the order of the rules significant, and use the first
> > matching rule. Reminds me sort of of the NCSA httpd config file. It would
> > be a very clear method, with probably more expressive power (regexps
> > would be just fine). Also a bit obscure (eg in Fred's example the last
> > rule would never be matched :-), but very simple to explain.
> >
> > So, what do you all think? Other argument against/in favour of any of
> > the above? Other solutions?
> >
> First off, I would propose that a new name for the new format be proposed,
> something like nrobots.txt, AND that it contain
>
> Robots: 2