Martijn Koster wrote:
> My focus is not on new features (although I did add Allow),
> but on providing a more solid specification which addresses concerns of
> ambiguity and completeness.
Good. We'll fight out regexps/globbing & stuff afterwards.
> As a result the spec itself looks complex,
> though the format itself isn't; this is _not_ a user guide, that would
> look completely different, and far simpler.
A formal spec is good but I don't think there is anything wrong with a
separate users's guide chapter in the same RFC.
Alternatively, it should *refer to* the user's guide.
> Comments are allowed anywhere in the file, and consist of a comment
> character '#'
...at the beginning of the line or preceded by whitespace?
> followed by the comment, terminated by the end-of-line.
> 3.2.1 The User-agent line
A little vague. If several User-Agent lines match, which one to use? I
assume the first, except that User-Agent * is considered last even if it
occurs first.
Suggestion (though I hope someone can make it into better English):
"The robot should send it's name in a User-Agent: header over protocols
that allow such a header, e.g. HTTP/1.0. The robot must obey the first
record in /robots.txt whose User-Agent value is a substring of the
robot's User-Agent: header contents. If no such record exists, it
should obey the `User-Agent: *' record, if present. Otherwise, access
is unlimited.
The name comparisons are case-insensitive."
3.2.2 The Allow and Disallow lines
> If a %xx encoded octet is
> encountered it is unencoded prior to comparison, unless it is the
> "/" character, which has special meaning in a path.
Good point. However, remember there are other special chars as well.
RFC 1738, chapter 2.2, says:
The characters ";", "/", "?", ":", "@", "=" and "&" are the
characters which may be reserved for special meaning within a scheme.
No other characters may be reserved within a scheme.
A HTTP server may "/" and ";" and their encoded forms the same way.
Ours does. OTOH, a WWW gateway may treat the reserved chars differently
when encoded.
I can think of 4 ways to deal with this:
1. For simplicity, decode all %xx sequences, or
2. Don't decode any of the reserved chars, or
3. A scheme-dependent selection of which chars should not be decoded.
(According to RFC 1738:
HTTP -- "/", ";", and "?".
Gopher -- no chars
FTP -- "/" and ";"), or
4. Decode any but those that would probably be written unencoded if
they were to have the unencoded meaning.
(Yes, maybe "/" only is a good choice here. And "?", which
when decoded would start the query part.)
5. The chars mentioned in the /robots.txt line "Reserved-Chars:"
should not be decoded (and default is one of the options above).
I think I vote for (1). Or (4). Or... I really don't know.
Except I don't vote for (5) yet; it be nice for WWW gateways but I guess
gateways with strange URLs will have worse problems anyway; it probably
makes sense to add (5) when regexps are added but not before.
Regards,
Hallvard
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html