Re: An extended version of the Robots...

Captain Napalm (spc@armigeron.com)
Mon, 11 Nov 1996 16:07:32 -0500 (EST)


It was thus said that the Great Hallvard B Furuseth once stated:
>
> MK === Martijn Koster wrote:
>
> MK> this is too complex.
> MK> (...)
> MK> - Visit-time
> MK> - Request-rate
> MK>
> MK> (...) would be very much low on my list of priorities to implement,
>
> Divide the fields in obligatory "must implement" and advisory "may
> implement" fields. Visit-time and Request-rate would be advisory.

But if both Visit-time: and Request-rate: are advisory, what incentive do
the various search engines have in implementing them, since they limit when
and how fast the robot can work?

Unless the threat of not allowing the robot access at all is good enough.

> - The version field:
>
> MK> I don't think we need a version field, at least not yet;
>
> Quite so. Not *yet*. But we may regret version 2's lack of version
> field when we are trying to define version 5. I vote to add the field
> now, and either let it be purely informational (so far), or -- in order
> to enforce its use -- let a versionless robots.txt mean version 1.
>
So where does 'Robot-version:' go then? Within a rule set (like I have)
or at the top of the file (before any rule sets)? And I did state that of
'Robot-version:' is not found (within a rule set) then the robot should
treat the rule set as the 1.0.0 standard.

But, if we don't allow any directives to be expanded with new meaning,
then what is the use of a version directive, other than purely
informational?

I'd still ask that it be defined, if only because at some point, a version
identification may be needed (it was added to MS-DOS 2.0, the 386, etc etc).

> SPC === Captain Napalm (don't you have a real name?) wrote:

I do have a real name, but the system I use for mail doesn't allow enough
space for "Sean 'Captain Napalm' Conner". Sigh.

> SPC> Since 'Disallow:' cannot contain regexs (if you are concerned
> SPC> with compatibility), and therefore, 'Allow:' can't be used since
> SPC> using a new format for 'Allow:' might give the impression that
> SPC> 'Disallow:' allows the same semantics.
>
> Good point. Let's ban the field name `Allow:'.
>
>
> MK> This then means we can only add new fields, with new semantics for
> MK> them.
>
> SPC> what happens when you want to extend the functionality of an
> SPC> existing directive (like 'Disallow:' - see above)?
>
> Like MK said, we can't. If we want new functionality, we must put it in
> new fields, like your (Non-)?Retriaveable: fields.
>
Now, the problem with this is multiple fields that may have similar
functions (Disallow: and Non-retrievable: for example) but are different in
implementation, which makes for more confusion still.

> MK> Ah. Found a backwards compatibility problem: You say:
> MK>
> SPC> You either have them, or say to hell with backwards compatibility
> SPC> and break a lot of existing products.
>
> No, you design the new standard so it does not conflict with the old one.
>
For some reason, I keep thinking of the IBM PC and Windows, but I don't
know why ...

Maybe because we're still stuck with 15 year old technology (and it wasn't
that great when it came out) due to 'standards'.

> MK> an 1.0.0 robot, on seeing:
> MK>
> MK> User-agent: foo
> MK> Version: 2.0.0
> MK> Allow: /foo
> MK>
> MK> will say "/bar is not disallowed, thus approved", whereas a 2.0.0 robot
> MK> would say "No disallow, no 'Allow: /bar', thus denied". Hmmm, don't
> MK> like that...
> SPC>
> SPC> Good point. But I would say that adding a rule set for a new
> SPC> robot where you do not know if it supports the new standard, you
> SPC> should use the 1.0.0 standard unless you find out otherwise.
>
> No. Then a robots.txt's default rule set (User-agent: *) would have to
> assume the robot does not understand version 2. This makes version 2
> pretty useless.

You have to assume that reguardless of what the 2.0.0 spec looks like.

> SPC> Or we create a secondary file (say, robots2.txt) and have
> SPC> completely new semantics.
>
> Yes, that's another way. It provides the same functionality as new
> field names (which old robots will ignore) in version 2 of the standard.
> (Imagine that "Disallow:" is translated to "robots.txt-Disallow:" or
> "robots2.txt-Disallow:" depending on the file name.)
>
> I prefer a single robots.txt; fetching just one file requires fewer HTTP
> requests.
>
I too prefer a single file, but for other reasons.

> - precedence of general vs explicit vs regexp
>
> MK> It seems to be a lot more complicated than for example the "evaluate in
> MK> order" strategy I proposed initially:
>
Okay, I misunderstood what he meant by "evaluate in order". This is a
good idea and much simpler than what I had proposed.

> > What if the url doesn't match any rules? Say I have:
>
> Define that the fields for a User-Agent are followed by an implicit
> 'Allow: *', just like in version 1.
>
Okay.

-spc (Will be making updates then ... )