Re: An extended version of the Robots...

Martijn Koster ((no email))
Fri, 8 Nov 96 17:13:17 -0800


spc@armigeron.com, Captain Napalm, really Sean Conner wrote:

> take a look at http://www.armigeron.com/people/spc/robots2.html
> and give me feedback.

I meant to give feedback before, and still haven't found the time to
go really deep, but here are my initial reactions. It's my usual feeling:
this is too complex.

Specifically:

- the version field

I don't think we need a version field, at least not yet; as existing
robots ignore the new fields, we need any new scheme to be backwards
compatible. This means we cannot change the semantics of old files.
This then means we can only add new fields, with new semantics for
them. The mere occurence of these fields indicate they should be
honoured. There is no two-way information exchange where you need
to negotiate the highest common version of a protocol. So you don't
need a version. Asking people to understand multiple versions of a
protocol may put them off.

Ah. Found a backwards compatibility problem: You say:

| If there are no disallow rules, then the robot is only allowed to
| retrieve the URLs that match the explicit and/or regular
| expressions given.

This means that an 1.0.0 robot, on seeing:

User-agent: foo
Version: 2.0.0
Allow: /foo

will say "/bar is not disallowed, thus approved", whereas a 2.0.0 robot
would say "No disallow, no 'Allow: /bar', thus denied". Hmmm, don't
like that...

- the version field format

While I personally like Linux convention, I don't believe they are
very applicable to standards: three decimals encourages too much
playing, and the idea of "unstable standards" doesn't make a lot
of sense. It would make more sense to use the standards x/y as in
HTTP. If you need a version number, which I don't think we do.

- precedence of general vs explicit vs regexp

Having different precedence levels is one way to evaluate the rules.
Your document would benefit from an explicit pseudo-code algorighm
to do the evaluation. Do I understand it would be:

if (Version 1) {
if (url matches "Disallow: <general>" directive) {
return DISALLOW;
}
else {
return ALLOW;
}
}
else if (Version 2) {
if (url matches "Allow: <explicit>" directive) {
return ALLOW
}
if (url matches "Disallow: <general>" directive or
url matches "Disallow: <explicit>" directive or
url matches "Disallow: <regexp>" directive) {
return DISALLOW
}
if (there are no "Disallow: " directives) {
if (url matches "Allow: <regexp>") {
return ALLOW
}
else {
return DISALLOW
}
}
else {
return ALLOW
}
}
else {
return ALLOW;
}

Did I get this right?

It seems to be a lot more complicated than for example the "evaluate in
order" strategy I proposed initially:

for $rule (@rules) {
if ($url matches $rule)
if (rule says Disallow)
return Disallow
else if (rule says Allow)
return Allow

Note that you need no version numbers, it is still are backwards
compatible, and very easy to explain.
Does your complexity buy you any expressive power?

- Visit-time

Why restrict it to one occurence per ruleset? I might want robots
at 01:00-07:00, 20:00-23:00, which you could have as two Visit-times,
or even by changine the format like I did here (I like the '-' to
indicate a range)

Note incidentally, that this one would be very much low on my list
of priorities to implement, compared to e.g. Allow and regexps.

- Request-rate

| If more than one Request-rate: directive is given and does not
| include the time, use the one that requests the fewest documents
| per unit of time.

Hmmm... maybe this is just my reading, but does this say what you
mean? Given 2/60s and 1/1s the '1' is the fewest documents per time
unit, but 2/60 is the fewest documents per normalised time units.

What to do when you are faced with multiple overlapping timed
rates?

If you say 100/24h, does that mean I can do 100 retrievals
simultaneously?

Again, for me this one would be very much low on my list
of priorities to implement, compared to e.g. Allow and regexps.

- Perl regular expressions

This makes implementation outside Perl a near impossibility.
Normal v7 regexps are far easier to deal with.

That's all for now, I hope to get a dbit more time for this soon.
Incidentally, I find the silence from the big robot search
companies deafening... Does this indicate a lack of interest/commitment
or is everyone just too busy?