Re: RFC, draft 1

Martijn Koster (m.koster@webcrawler.com)
Tue, 19 Nov 1996 11:29:33 -0800


At 1:46 PM 11/16/96, Darren Hardy wrote:

lots :-) Thanks for your in-depth comments!

>> The instructions must be accessible via HTTP [2] from the site that
>> the instructions are to be applied to, as a resource of Internet
>> Media Type [3] "text/plain" under a standard relative path on the
>> server: "/robots.txt".
>
>Works with HTTPS too, or is that implied?

I think that is implied in that HTTPS sits below HTTP, and anything that
is true for HTTP is true for HTTPS, except that SSL is used instead of
straight TCP.

However, I have been unable to find a spec on integrating HTTP/SSL/HTML
that might confirm or contradict that assumption. Do you happen to have
any pointers to such a document?

The more general question what access methods this can be apllied to is
interesting to consider as well. As far as I can see, this could work in
any scheme that satisfies these conditions:
- URL follows the generic URL syntax, with %xx escaping
- only '/' is a special character in the url path,
- the netloc of the URL is the only part that identifies a "site"
- the netloc and path together are sufficient to define

But I have the feeling there may be many semantics you might miss
if you didn't explicitly mention supported URL schemes and their
corresponding Access Methods. We could just about do FTP, but anything
else I'd be wary of. So for now, let's solve the puzzle for HTTP alone.

>Following redirects really should be required.

Fair enough.

>It would make life for
>servers
>which serve out documents for multiple web hosts much easier. Big
>servers are often the ones most sensitive to robots. For example,
>a server which serves out dozens of vantiy domains could more easily
>implement /robots.txt per domain using redirection like so:
>
> http://www.vanity1.com/robots.txt -> redirect -> /robots/vanity1.txt
> http://www.vanity2.com/robots.txt -> redirect -> /robots/vanity2.txt
>
>One of the problems with the robots.txt format is that it doesn't
>address
>servers which serve virtual collects like the vanity names (or software
>virtual servers). Redirection might be the easier way to go, rather
>than
>modifying the robots.txt format to support collections.

I _really_ don't want to have special hacks for virtual hosts creeping
into the contents of the /robots.txt, and not really in the spec either.

My feeling that the vanity problem is broken beyond fixing in a simple
format like /robots.txt. In my job I see redirects mess things up a lot.

If a server supports vanity domain names, it should do it properly,
and separate complete separate URL name spaces depending on either
the interface a request came in on, or using the Host header.

>I suppose the alternative is to require that these folks generate
>robots.txt
>on-the-fly with a CGI or something, but I don't think that's as
>attractive.

Well, it really depends on the nature of the server what makes sense or
not; Simply demanding that to get a /robots.txt for a given http URL
you take the netloc and append a "/robots.txt" path specifies all you
need to know, and gives servers complete freedom in deciding how to
deal with that.

>> /a%2fb.html /a%2fb.html yes
>> /a%2fb.html /a%2Fb.html yes
>> /a%2fb.html /a/b.html no
>> /%7ejoe/index.html /~joe/index.html yes
>> /~joe/index.html /%7Ejoe/index.html yes
>
>This is very helpful. Is the %2f thing valid for URLs?
>Should double check with the URL spec. I think that '/'
>has special meaning and isn't equivalent to %2f. I'm
>not sure that the /a%2fb == /a/b is correct.

You are abosultely right, that's why the "match" column says "no".
Am I missing something here?
>
>Excellent. The addition of Allow is much needed.
>It makes the format much more useful.

It prevents Disallow: /A, Disallow /B, etc :-)

>My point is that in the future the format should be able to specify
>Allow/Disallow rules based on information other than the URL.

I should certainly hope it is flexible enough to do that in future.
Note that current BNF doesn't actually allow for new field/value
pairs (duh!) so I'll fix :-)

>Content-type
>or Content-Language are useful ones for example. Here's an example of
>what I mean in a hacked syntax:
>
> User-Agent: Some-French-Robot
> Allow[Content-Language]: french
> Disallow[Content-Language]: .*
>
> User-Agent: *
> Disallow[Content-Type]: image/jpeg
> Allow[Content-Type]: *

Hmmm... how much this is robot administrators wanting to control what
they index, rather than server administrators wanting to control what
robots index on their site?

I think these kinds of things are a slippery slope; what really defines
the content of a URL retrieval is a function of so much confusing stuff
in HTTP; if too much of HTTP negotiation would move into /robots.txt we
are probably shooting ourselves in the foot by making interoperability
mistakes. At some point you have to say that if a server wants to refuse
specific retrievals, why doesn't it do so in HTTP?

>The Harvest Gatherer also allows some more flexible rule definitions
>based on protocols, hosts, etc. Details on that are here:

Thanks for the references, I'll check them out.

>It might be very tough for content providers to use the "standard HTTP
>cache-control"
>mechanisms to specify Expires headers, for example, since robots.txt
>uses
>the text/plain type, not HTML. Typically, you would use HTML to do
>this:
> <META HTTP-EQUIV="Expires" CONTENT="blah">
>or whatever. Many Web servers have poor support for expiration in HTTP.

A typical example of a situation where the real problem should be fixed,
not hacked in an application-specific thing like /robots.txt, IMHO.

>So, I'd suggest explicitly adding an Expiration field to the robots.txt
>format
>Using the HTTP date format, IMHO.

The problem is that the expiration is on the entire file, and the format
wasn't defined to accomodate that (a mistake, in hindsight). So the only
real option is to add Expires to every record, which is somewhat
unattractive [but see below].

The other thing is that it gets messy when you have conflicting directives.
Which do you follow? What incentive is there for robots to implement the
proper HTTP-based cache-control once they "have a solution"?

That's why I haven't put it in. If anyone feels strongly about it one
way or other I'd like to hear it.

>I would like to see version information in the file format which
>identifies
>the version of the format used. It could be done via a comment on the
>first
>line like so:
> # robots.txt-Version: 2.0
>or something. Or, through a version tag:
> Version: 2.0
>
>I think that it would help with future backward compatibilty problems,
>and
>for robots.txt which are programmatically generated IMHO.

That has been suggested before, and I'm not in principle against it,
but wonder if we shouldn't fix it well once and for all, and introduce a
file-wide header at the start of the file. Whose /robots.txt implementations
would break if a /robots.txt started with:

Robots-Version: 2.0
Expires: ...

>> User-Agent: ...

>> As the majority of /robots.txt files are created with platform-
>> specific text editors, robots should be liberal in accepting files
>> with different end-of-line conventions, specifically CR and LF in
>> addition to CRLF.
>
>This must change to encourage any kind of real deployment of robots.txt
>on servers.

Yes and no. If we had required authoring support it would never have happened.
But I agree automatic generation is preferable, and post-generation checking
(lint) would be better than nothing.

But if we are pragmatic, the text above must stay in, for backwards
compatibility with the installed base.

>robots.txt must be programmatically generated by Web server
>administration tools or something.

Sure.

Better yet, web servers admin tools should allow people to specify
all sorts of things about their URL's, such as negotiation related stuff
and even User-agent filtering, then a whole lot of other stuff could be
solved too :-)

>The format should encourage such things.

The spec says you must adhere to a concrete BNF, so if you don't,
you don't comply. You can't get much more encouraging than that in a spec...

>Perhaps we could put some robots.txt authoring code to help
>the admin tools guys...

This sounds like a useful (separate!) thread. Personally I'm not so sure
on what it would look like.

>Using certificates to verify the authenticity of a /robots.txt sounds
>very valuable.

Yes.

>Perhaps mention the channels that Web site administrators can use to
>enforce /robots.txt honoring. Things like checking the 'From' field
>to get an email address for the robot runner, the 'User-Agent' field
>to get the identity of the robot runner, the 'Referer' field to find
>out more information about that robot. Maybe even the robots mailing
>list to report out-of-control robots, etc.

That stuff firmly belongs in the User Guide to Robots Exclusion,
which is on my list right after the RFC :-)

Thanks again for your comments

-- Martijn

Email: m.koster@webcrawler.com
WWW: http://info.webcrawler.com/mak/mak.html

_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html