Re: robots.txt syntax

John D. Pritchard (jdp@cs.columbia.edu)
Mon, 14 Oct 1996 12:17:25 -0400


> > if disallow rules take precedence, everything's simple. if both rule types
> > are matching, then the request is disallowed. i don't think this impedes
> > the expressiveness of the resulting rule language. the rule language then
> > has disallow as an absolute disallow rather than a fuzzy, implementation
> > dependent disallow. :-)
> >
> Well, the reason I had explicit allows higher than disallow was for the
> following case:
>
> Allow: /index.html
> Disallow: *
>
> Such that the only file allowd would be /index.html and nothing else.
> After that, disallow rules take precedence. Something that hasn't been
> clarified (by me or anyone else) is the following:
>
> Allow: /index.html
>
> Does that mean ONLY /index.html is allowed? In general, if it isn't
> explicitely allowed, generally disallowed, is it allowed?
>
> -spc (And nothing in my scheme would disallow explicit disallows ... )
>

well, there is a subtle problem which arises when we're not sure about when
a rule is a URL ("absolute" URL) and when it's a URI (general request), as
in my system. In other words, whether or not a URI is "absolute", a URL,
or only a URI is not *defined* by any standard in terms of pattern
matching, eg, the method you suggest... regexp "/[a-zA-Z/]*\.html".

URLs' ("absolute" URLs') rules may look like "/[a-zA-Z/]*\.html" but they
could also be "/[a-zA-Z/]*" so that the "absolute case" you refer to, the
URL case in this context, could be either

/path/filename

or

/path/filename.html

on my system (wwweb server).

so when any URI is identical to any other URI for determining the Dis/Allow
status of a request URI, then we simply have to allocate a tie-breaking
scheme.

-john