Re: An extended version of the Robots...

Hallvard B Furuseth (h.b.furuseth@usit.uio.no)
Mon, 11 Nov 1996 16:21:45 +0100 (MET)


MK === Martijn Koster wrote:

MK> this is too complex.
MK> (...)
MK> - Visit-time
MK> - Request-rate
MK>
MK> (...) would be very much low on my list of priorities to implement,

Divide the fields in obligatory "must implement" and advisory "may
implement" fields. Visit-time and Request-rate would be advisory.

- The version field:

MK> I don't think we need a version field, at least not yet;

Quite so. Not *yet*. But we may regret version 2's lack of version
field when we are trying to define version 5. I vote to add the field
now, and either let it be purely informational (so far), or -- in order
to enforce its use -- let a versionless robots.txt mean version 1.

SPC === Captain Napalm (don't you have a real name?) wrote:

SPC> Since 'Disallow:' cannot contain regexs (if you are concerned
SPC> with compatibility), and therefore, 'Allow:' can't be used since
SPC> using a new format for 'Allow:' might give the impression that
SPC> 'Disallow:' allows the same semantics.

Good point. Let's ban the field name `Allow:'.

MK> This then means we can only add new fields, with new semantics for
MK> them.

SPC> what happens when you want to extend the functionality of an
SPC> existing directive (like 'Disallow:' - see above)?

Like MK said, we can't. If we want new functionality, we must put it in
new fields, like your (Non-)?Retriaveable: fields.

Step by step: The new standard version defines a new field name with the
new functionality. It goes on to tell high-version robots to ignore the
old field, to give it low precedence, or to processed it just like
before, or whatever is appropriate. Older robots are not affected.

MK> Ah. Found a backwards compatibility problem: You say:
MK>
SPC> You either have them, or say to hell with backwards compatibility
SPC> and break a lot of existing products.

No, you design the new standard so it does not conflict with the old one.

MK> an 1.0.0 robot, on seeing:
MK>
MK> User-agent: foo
MK> Version: 2.0.0
MK> Allow: /foo
MK>
MK> will say "/bar is not disallowed, thus approved", whereas a 2.0.0 robot
MK> would say "No disallow, no 'Allow: /bar', thus denied". Hmmm, don't
MK> like that...
SPC>
SPC> Good point. But I would say that adding a rule set for a new
SPC> robot where you do not know if it supports the new standard, you
SPC> should use the 1.0.0 standard unless you find out otherwise.

No. Then a robots.txt's default rule set (User-agent: *) would have to
assume the robot does not understand version 2. This makes version 2
pretty useless.

SPC> Or we create a secondary file (say, robots2.txt) and have
SPC> completely new semantics.

Yes, that's another way. It provides the same functionality as new
field names (which old robots will ignore) in version 2 of the standard.
(Imagine that "Disallow:" is translated to "robots.txt-Disallow:" or
"robots2.txt-Disallow:" depending on the file name.)

I prefer a single robots.txt; fetching just one file requires fewer HTTP
requests.

- precedence of general vs explicit vs regexp

MK> It seems to be a lot more complicated than for example the "evaluate in
MK> order" strategy I proposed initially:

Agreed. Also, MK's "evalueate in order" is more general than precedence
rules. "Evalueate in order" allows:

Regexp-Disallow: ^[LUMS](.*,)?o=Universitetet_i_Oslo,c=NO
Regexp-Allow: ^[LUMS](.*,)?c=NO
Regexp-Disallow: ^[LUMS].*
Regexp-Allow: ^.*

I can't express that with SPC's precedence rules. OTOH, "evalueate in
order" can express everything precedence rules can. We just write the
fields in robots.txt in the way they are ordered by SPC's algorithm:
First the Explicit-Allow fields, then Disallow, then Regexp-Allow, then
optionally Disallow:*.

> What if the url doesn't match any rules? Say I have:

Define that the fields for a User-Agent are followed by an implicit
'Allow: *', just like in version 1.

Regards,

Hallvard