Re: An extended version of the Robots...

Captain Napalm (spc@armigeron.com)
Sat, 9 Nov 1996 02:32:19 -0500 (EST)


You know, I post this to the list, I get almost no critical feedback. I
make it a web page, I get critical feed back. 8-)

Anyway ...

It was thus said that the Great Martijn Koster once stated:
>
> spc@armigeron.com, Captain Napalm, really Sean Conner wrote:
>
> > take a look at http://www.armigeron.com/people/spc/robots2.html
> > and give me feedback.
>
> I meant to give feedback before, and still haven't found the time to
> go really deep, but here are my initial reactions. It's my usual feeling:
> this is too complex.
>

> Specifically:
>
> - the version field
>
> I don't think we need a version field, at least not yet; as existing
> robots ignore the new fields, we need any new scheme to be backwards
> compatible. This means we cannot change the semantics of old files.

I think I covered that. If, in a rule set no version is found, the robot
has to assume the old semantics for that rule set.

The current standard only has two directives: 'User-agent:' and
'Disallow:'. Any other directives are ignored. If you are going to allow
regexs for 'Disallow:' you have to have a way to tell the robot what format
it is using, the old standard, or the new standard.

A robot using the old standard (named 1.0.0 in my document) coming across
the new standard is not going to work. If you are concerned about this,
then you have to state that 'Disallow:' maintains the current semantics (the
poor partial match one) and that there are now two new fields:

Retrievable:
Non-retrievable:

Since 'Disallow:' cannot contain regexs (if you are concerned with
compatibility), and therefore, 'Allow:' can't be used since using a new
format for 'Allow:' might give the impression that 'Disallow:' allows the
same semantics.

> This then means we can only add new fields, with new semantics for
> them. The mere occurence of these fields indicate they should be
> honoured. There is no two-way information exchange where you need
> to negotiate the highest common version of a protocol. So you don't
> need a version. Asking people to understand multiple versions of a
> protocol may put them off.
>
True enough. But what happens when you want to extend the functionality
of an existing directive (like 'Disallow:' - see above)?

> Ah. Found a backwards compatibility problem: You say:
>
You either have them, or say to hell with backwards compatibility and
break a lot of existing products.

> | If there are no disallow rules, then the robot is only allowed to
> | retrieve the URLs that match the explicit and/or regular
> | expressions given.
>
> This means that an 1.0.0 robot, on seeing:
>
> User-agent: foo
> Version: 2.0.0
> Allow: /foo
>
> will say "/bar is not disallowed, thus approved", whereas a 2.0.0 robot
> would say "No disallow, no 'Allow: /bar', thus denied". Hmmm, don't
> like that...
>
Good point. But I would say that adding a rule set for a new robot where
you do not know if it supports the new standard, you should use the 1.0.0
standard unless you find out otherwise.

Or we create a secondary file (say, robots2.txt) and have completely new
semantics. I first proposed that, but seeing how robots.txt is hardly being
used, rejected that.

> - the version field format
>
> While I personally like Linux convention, I don't believe they are
> very applicable to standards: three decimals encourages too much
> playing, and the idea of "unstable standards" doesn't make a lot
> of sense. It would make more sense to use the standards x/y as in
> HTTP. If you need a version number, which I don't think we do.
>
I think we need version numbers (and using the HTTP version is a good
idea), if only to indicate new functionality (ROBOT/2.0 allows regexs,
ROBOT/1.0 doesn't).

> - precedence of general vs explicit vs regexp
>
> Having different precedence levels is one way to evaluate the rules.
> Your document would benefit from an explicit pseudo-code algorighm
> to do the evaluation.

Okay:

if (Version 1)
{
if (url matches "Disallow: <general>" directive)
return DISALLOW;
else
return ALLOW;
}
else if (Version 2)
{
if (url matches "Allow: <explicit>" directive[s])
return ALLOW;
else if (url matches "Disallow:" directives[s])
return DISALLOW;
else if (url matches "Allow: <regex>" directive[s])
return ALLOW;
else if (allow rules && no disallow rules)
return DISALLOW;
else
return ALLOW;
}

This covers everything. If there is an allow rule set and no disallow
rule set, then anything not allowed is diallowed, else if there are disallow
rules and no allow rules, anything not disallowed is allowed, and if there
are both, and nothing covers, then it's allowed.

Maybe I didn't explain it well enough. I think I will include this pseudo
code on the page.

> Do I understand it would be:
>
> if (Version 1) {
> if (url matches "Disallow: <general>" directive) {
> return DISALLOW;
> }
> else {
> return ALLOW;
> }
> }
Okay

> else if (Version 2) {
> if (url matches "Allow: <explicit>" directive) {
> return ALLOW
> }
> if (url matches "Disallow: <general>" directive or

I probably didn't state this clearly. 'Disallow: <general>' only applies
to 1.0.0 (else, how do you know if you have a <general> or <explicit>
pattern?).

> url matches "Disallow: <explicit>" directive or
> url matches "Disallow: <regexp>" directive) {
> return DISALLOW
> }
> if (there are no "Disallow: " directives) {
> if (url matches "Allow: <regexp>") {
> return ALLOW
> }
> else {
> return DISALLOW
> }
> }
> else {
> return ALLOW
> }
> }
> else {
> return ALLOW;
> }
>
> Did I get this right?
>
Pretty much, but a bit verbose.

> It seems to be a lot more complicated than for example the "evaluate in
> order" strategy I proposed initially:
>
> for $rule (@rules) {
> if ($url matches $rule)
> if (rule says Disallow)
> return Disallow
> else if (rule says Allow)
> return Allow
>
> Note that you need no version numbers, it is still are backwards
> compatible, and very easy to explain.
> Does your complexity buy you any expressive power?
>
What if the url doesn't match any rules? Say I have:

User-agent: foobot
Robot-version: 2.0.0 # ROBOT/2.0 anyone?
Allow: /index.html
Allow: /info.html

What is disallowed? Under my scheme, only these two files will be allowed
(and I do have a site like that - only one or two documents are "safe" for a
robot to pick up on). Under your scheme, I might try:

User-agent: foobot
Allow: /index.html
Allow: /info.html
Disallow: *

But what has precedence? You have disallows higher than allows. This
isn't what I want, and we're more or less back to what we have now (long
lists of what isn't allowed).

> - Visit-time
>
> Why restrict it to one occurence per ruleset? I might want robots
> at 01:00-07:00, 20:00-23:00, which you could have as two Visit-times,
> or even by changine the format like I did here (I like the '-' to
> indicate a range)
>
I was thinking of that when typing up the document, but decided to keep it
as is (from my initial posting). It seems reasonable to allow more than one
'Visit-time:' directive. Unless anyone else has anything to say against
allowing more than one, I'll amend it.

Adding the '-' seems okay. But how to specify more than one visit time?
A comma separated list, like you have? Or separate directives for each one?
I'd be included for both myself, but if I had to pick, I would pick a
separate directive for each visit time.

> - Request-rate
>
> | If more than one Request-rate: directive is given and does not
> | include the time, use the one that requests the fewest documents
> | per unit of time.
>
> Hmmm... maybe this is just my reading, but does this say what you
> mean? Given 2/60s and 1/1s the '1' is the fewest documents per time
> unit, but 2/60 is the fewest documents per normalised time units.
>
Okay, I think I'll amend the document for fewest documents per normalized
time units.

> What to do when you are faced with multiple overlapping timed
> rates?
>
I would say that the one that requests the fewest documents per normalized
unit of time is the safer choice.

> If you say 100/24h, does that mean I can do 100 retrievals
> simultaneously?
>
No, 100 over a period of 24h, spread equally (in this case, one document
about every 11 minutes).

> - Perl regular expressions
>
Did I say Perl? <checks document> I did say Perl. How about that.

The reason I said Perl is that it's used quite extensively for Web work,
unless I go and explicitely define what regexprs to use.

> This makes implementation outside Perl a near impossibility.
> Normal v7 regexps are far easier to deal with.
>
What would that be? What /bin/sh uses? Tcsh? grep? Perl? Do you use
'?' or '.' to match a single character? What about one or more characters?
This may be impossible to do correctly, given the different number of
regexprs and meta-characters that are possible (I like the set used in
AmigaDOS myself, but they're hardly used outside of the Amgia world).

I picked a pragmatic choice. Maybe not a good one. But one that is out
there and very commonly used (I don't like Perl myself that much).

> That's all for now, I hope to get a dbit more time for this soon.
> Incidentally, I find the silence from the big robot search
> companies deafening... Does this indicate a lack of interest/commitment
> or is everyone just too busy?
>
I hope everyone is just too busy.

-spc (Maybe a way to get robots.txt on more servers is to have an existing
web server make mention of it, like Apache ... )