Re: RFC, draft 1

Martijn Koster (m.koster@webcrawler.com)
Tue, 19 Nov 1996 11:29:52 -0800


At 5:19 PM 11/18/96, Hallvard B Furuseth wrote:

>Nice work!

Thanks, and thanks for your review.

>A formal spec is good but I don't think there is anything wrong with a
>separate users's guide chapter in the same RFC.

I stronlgy disagree. The spec should be self-supporting, unambiguous,
and be the final arbiter. A user's guide would only introduce fuzzieness
which would distract from that.

More importantly though, the User-Guide is too large and dynamic:
every time a new server is released for which the config works differently,
it should be updated. It should be translated in different languages.
As mailing list locations change, it needs to be updated. As Darren's
authoring tool gets done, it needs to be updated. All of this makes it
impossible to into an RFC.

I don't think this is a problem, collectively we can spread the word about
where this user guide lives. Maybe we should even relocate to w3.org.
Anyway, let's address that when the guide has been written.

>Alternatively, it should *refer to* the user's guide.

The reverse should certainly be true.

>
>> Comments are allowed anywhere in the file, and consist of a comment
>> character '#'
>
>...at the beginning of the line or preceded by whitespace?

Consult the BNF :-) Point taken though.

>A little vague. If several User-Agent lines match, which one to use? I
>assume the first, except that User-Agent * is considered last even if it
>occurs first.

The wording is alittle vague because the concept is a little vague :-)

The way I thought robots might want this to work is to have a lits
of names, in order of precedence. Say a robot is called "foo",
and later improved and renamed "bar". It could search for "bar"
Useragent lines first, and failing that, try "foo" User-agent lines
instead. Vague, as said.

>Suggestion (though I hope someone can make it into better English):
>
>"The robot should send it's name in a User-Agent: header over protocols
>that allow such a header, e.g. HTTP/1.0. The robot must obey the first
>record in /robots.txt whose User-Agent value is a substring of the
>robot's User-Agent: header contents. If no such record exists, it
>should obey the `User-Agent: *' record, if present. Otherwise, access
>is unlimited.
>
>The name comparisons are case-insensitive."

Different, but less vague, and udoubtedly better.

>Good point. However, remember there are other special chars as well.
>RFC 1738, chapter 2.2, says:
> The characters ";", "/", "?", ":", "@", "=" and "&" are the
> characters which may be reserved for special meaning within a scheme.
> No other characters may be reserved within a scheme.

I know, but note that these are only reserved because they delimit the
path component of the URL. In /robots.txt you can only have a path,
and therefore they don't _need_ to be reserved. This means that if
a path in a /robots.txt is "illegal" in that someone uses "~" instead of
%7E, the matching algorithm will fix this on the fly. It doesn't impact
valid paths, and disambiguates the handling of invalid paths (which in case
of "~" occur often.

Or that's my thinking anyway.

>A HTTP server may "/" and ";" and their encoded forms the same way.
>Ours does. OTOH, a WWW gateway may treat the reserved chars differently
>when encoded.
>
>I can think of 4 ways to deal with this:
> 1. For simplicity, decode all %xx sequences, or

Absolutely not, that breaks the URL specs required handling of '/'

> 2. Don't decode any of the reserved chars, or

That has bitten me in the past, specifically in the ~ case, and in
the case-insensitive nature of the %xx.

> 3. A scheme-dependent selection of which chars should not be decoded.
> (According to RFC 1738:
> HTTP -- "/", ";", and "?".
> Gopher -- no chars
> FTP -- "/" and ";"), or
> 4. Decode any but those that would probably be written unencoded if
> they were to have the unencoded meaning.
> (Yes, maybe "/" only is a good choice here. And "?", which
> when decoded would start the query part.)

That's my tack. Except query parts are not allowed in /robots.txt,
so the '?' need not be required.

> 5. The chars mentioned in the /robots.txt line "Reserved-Chars:"
> should not be decoded (and default is one of the options above).

I don't think I like that much...

>it probably makes sense to add (5) when regexps are added but not before.

I don't want to get into this too much, but the way I thought of handling
that is not using the same BNF for DisallowRegexp values as for Allow
values, and in the DisallowRegexp case require that special characters
be encoded if they are not to signify regexp semantics. This may turn
out to make the common case look obvious, while unambiguously allowing
exceptions. Anyway, I'm not done considering that yet, so don't consider
this a proposal :-)

OK, back to the text edittor now....

-- Martijn

Email: m.koster@webcrawler.com
WWW: http://info.webcrawler.com/mak/mak.html

_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html