Re: robots.txt syntax

Captain Napalm (spc@armigeron.com)
Mon, 14 Oct 1996 12:01:54 -0400 (EDT)


>From some obscure corner of the Matrix, John D. Pritchard was seen transmitting:
>
>
> > > if disallow rules take precedence, everything's simple. if both rule types
> > > are matching, then the request is disallowed. i don't think this impedes
> > > the expressiveness of the resulting rule language. the rule language then
> > > has disallow as an absolute disallow rather than a fuzzy, implementation
> > > dependent disallow. :-)
> > >
> > Well, the reason I had explicit allows higher than disallow was for the
> > following case:
> >
> > Allow: /index.html
> > Disallow: *
> >
> > Such that the only file allowd would be /index.html and nothing else.
> > After that, disallow rules take precedence. Something that hasn't been
> > clarified (by me or anyone else) is the following:
> >
> > Allow: /index.html
> >
> > Does that mean ONLY /index.html is allowed? In general, if it isn't
> > explicitely allowed, generally disallowed, is it allowed?
> >
> well, there is a subtle problem which arises when we're not sure about when
> a rule is a URL ("absolute" URL) and when it's a URI (general request), as
> in my system. In other words, whether or not a URI is "absolute", a URL,
> or only a URI is not *defined* by any standard in terms of pattern
> matching, eg, the method you suggest... regexp "/[a-zA-Z/]*\.html".
>
Which is why, in my proposal, I suggested we go to another file entirely.
Why I perhaps didn't make clear. Under the current robots.txt standard, the
line:

Disallow: /foobar

is equivilent to :

Disallow: /foobar*

The rational (I suppose) is that it's easier to implement via (in C)
strncmp(). The problem is, given that, how do we introduce regular
expressions (or even simple wildcards) into robots.txt? Which is why I
proposed a new file for the standard.

> URLs' ("absolute" URLs') rules may look like "/[a-zA-Z/]*\.html" but they
> could also be "/[a-zA-Z/]*" so that the "absolute case" you refer to, the
> URL case in this context, could be either
>
> /path/filename
>
> or
>
> /path/filename.html
>
> on my system (wwweb server).
>
In my case, both "/[a-zA-Z/]*\.html" and "/[a-zA-Z/]*" are both regular
expressions and NOT explicit. Explicit references would NOT contain regular
expression.

-spc (Or, treat everything as regular expressions, with Disallow having
a higher precedence than Allow )