Re: robots.txt syntax

John D. Pritchard (jdp@cs.columbia.edu)
Tue, 15 Oct 1996 07:05:30 -0400


> >> >Disallow: /
> >> >Allow: /public/
>
> >in a line, if any Disallow rule matches to the request url, then it is no
> >good.
>
> So in the above try /public/index.html would be disallowed?
> That's not good enoguh -- we need that functionality.

that's not a match... gimme a break! "/*" would've been a match.

RobotExclusionFile := {Rule }+

Rule := Type, ": ", Variable

if we combine the idea (i think spc's) that we differentiate between regexp
Variables and non-regexp Variables with the Disallow-precedence idea we can
come to a more sophisticated robots exclusion language.

Type := "Allow" | "Disallow"

Variable := Path | Path-Regexp

Pathfile, front-slash directory separator and usual limited
character set.

Pathfile-Regexp, regular expression syntax and semantics as found
in awk for target Pathfile strings.

Disallow's take precedence when they match and exclude a request URI from
robots' usage, given that some rule variables will be regular expressions.

If a robots.txt has

Disallow /

then this means no request of the form

GET /

because the rule was explicit. If it says

Disallow /*

then this means that the site is verboten for robots.

-john