Re: An extended version of the Robots...

Hallvard B Furuseth (h.b.furuseth@usit.uio.no)
Tue, 12 Nov 1996 10:41:30 +0100 (MET)


It was thus said that the Great Captain Napalm once stated:
>It was thus said that the Great Hallvard B Furuseth once stated:
>>
>> Divide the fields in obligatory "must implement" and advisory "may
>> implement" fields. Visit-time and Request-rate would be advisory.
>
> But if both Visit-time: and Request-rate: are advisory, what incentive
> do the various search engines have in implementing them, since they
> limit when and how fast the robot can work?

The same incentive as they have in implementing any of RES, which might
not be very much. Again, I think someone should provide a Perl library
and a C library which implements the entires RES; that should help
considerably.

> Unless the threat of not allowing the robot access at all is good enough.

We might add that some number of WWW services do not respond to clients
that are detected to ill-behaved robots, or complaints can be sent to
the sysadmin of a site where someone is running a robot, or...

Hmm, maybe the RES should be just one chapter of a larger document: "WWW
Robots and Spiders: Design Rules and Pitfalls". Or that document could
be a companion to RES, intended to be noticed by robot authors, while
RES would be intended ot be noticed by WWW authors.

> So where does 'Robot-version:' go then? Within a rule set (like I
> have) or at the top of the file (before any rule sets)?

Oops, I didn't notice that. Sorry; I was thinking of a robots.txt-wide
version. I've got to think about that one a bit.

>> SPC> what happens when you want to extend the functionality of an
>> SPC> existing directive (like 'Disallow:' - see above)?
>>
>> Like MK said, we can't. If we want new functionality, we must put it in
>> new fields, like your (Non-)?Retriaveable: fields.
>>
> Now, the problem with this is multiple fields that may have similar
> functions (Disallow: and Non-retrievable: for example) but are different
> in implementation, which makes for more confusion still.

It's still the best we can do; too bad you don't like it:-) We also need
different field names for explicit and regexp match. Otherwise the
robot won't know whether the "." in "foo.bar" is a dot or an `any-char'.

A better approach might be to use headers with *options*. Example:

Hide: /Foobar # explicit
Hide;R: /Foo.*barbaz # regexp
Show;R;I: /foo.*bar # regexp; ignore case
Show;3-: /baz.* # Only for robot versions 3 and later

That simplifies things quite a bit; 2 keywords + 2 options can express
the same as 8 keywords.

Of course we still need the `Disallow:' lines; WWW authors must write
robots.txt so that old robots will behave properly, or at least as well
as it's possible to make them. We might recommend that from <some time
after RES version 2 is released>, people insert `Disallow: /' in order
to encourage old robots to upgrade.

The ";fromVersion-toVersion" option would also ensure we won't need new
keywords the next time we want different meanings for old keywords.
Still, I'm not sure I like that idea; a record-wide Robot-Version: might
be better. Or a file-wide one which affects all records down to the
next Robot-Version: line. Or something. Or nothing:-)

>> No, you design the new standard so it does not conflict with the old one.
>
> For some reason, I keep thinking of the IBM PC and Windows, but I
> don't know why ...

OK, we *try* to design the new standard that way:-)
So far at least, that's quite possible.

>> SPC> I would say that adding a rule set for a new
>> SPC> robot where you do not know if it supports the new standard, you
>> SPC> should use the 1.0.0 standard unless you find out otherwise.
>>
>> No. Then a robots.txt's default rule set (User-agent: *) would have to
>> assume the robot does not understand version 2. This makes version 2
>> pretty useless.
>
> You have to assume that reguardless of what the 2.0.0 spec looks like.

No no no -- it's true that webmasters should write their robots.txt so
it works for old robots, but they may also add rules that makes new
robots behave *better* than old ones. That would be impossible if new
rules interfere with old ones, so the default rule set cannot contain
new rules.

Regards,

Hallvard