Re: robots.txt extensions

Martijn Koster (m.koster@webcrawler.com)
Wed, 10 Jan 1996 14:05:30 -0700


At 10:35 AM 1/10/96, Adam Jack wrote:
>Hello,
>
>Since this list started I've only ever seen one suggestion
>for an extension to robots.txt.

A extension discussion document sounds like an ideal, though belated,
New Years resolution :-)

>to add expiry information for the
>robots.txt file itself. No response appears to have been given
>-- did people not think it worth while? Did people think the
>HTTP response field, Expires, should be used for that?

Yes, and I also don't think its something widely wanted, and
that is will be confusing to people (who don't understand
all the ins and outs anyway (How about a separate 'funny messages
in /robots.txt thread? :-).

The thing about expires is that it is a prediction, and people are
not good at making predictions; they want a "I changed it, now
update all your robots out there" push scheme.

Does submitting a '/robots.txt' manually to robots bump it up in
the queue (does in WebCrawler)? Then you could use submit-it to
do the push :-)

>I don't know if this was discussed to death somewhere -- but
>are people still considering extensions to robots.txt? I'd be
>interested in any pointers to an archive of such a discussion.

My thoughts never made it to the list :-)

>If there is point in discussion additions pls read on --
>otherwise bin this mail.

No, by all means. But most of all I want to keep things simple.

>MinRequestInterval: X
>
> Minimum request interval in seconds, (0=no minimum),
> with a default, if missing, of 60.
>
> This is for those of us lowely enough not to have huge
> gathering tasks and the luxury ;-) of a backlog of URLs
> over distributed sites. (I.e. Those of us doing a
> sequential search exhausting our interest in a site in
> one slurp.) Additionally local admins would have more
> control over wanderers that visted.

Interesting, I didn't think people still did that :-)
I think 60 is a sensible default, so lets think about why you would
change it from that. There seems little point in setting it much higher,
because even on the worst platform one requets per minute is no problem
(unless previous connections are still open). But who would set it
much lower? Only someone who wants to run a robot to their own site,
in which case they can control the speed themselves...

So is it worth doing it at all?

>DefaultIndex: index.html
>
> Stating that XXXX/ and XXXX/index.html are identicle.
>
> You can argue that this is lamely inadequate - or that it
> makes a saving. I know the bigger issue is recusion. Here
> I am merely hoping to save those single page recusions.

Yes, I do argue that this is lamely inadequate; I too think checksums
are the way for this, even if it is post-retrieval; pre-retrieval is
always a guess (even if we could have an If-not-md5 HTTP header)

>CGIMask: *.cgi
>
> Rather than guessing at CGI urls -- why not get the local
> admin to answer it? I know that the WN server uses a file
> extension to indicate a CGI script -- not /cgi-bin/.
>
> Q: Are CGI scripts universally avoided in advance -- or do
> robots look at the HTTP flags of results to try to work
> out wether some content is dynamically generated?

I always think you shouldn't make a distinction between dynamically
generated output and static output. What you should pay attention to
is things like Expires, form and queries, and outrageous recursion...

>Finally -- I never understood why robots.txt was exclusion only.
>Why does it not have some of positive hints added? I.e. you are
>allowed & welcome to browse XXXX/fred.html. Was this a choice
>built upon pragmatism -- thinking that this would open a can of
>worms?

Ha, finally someone who understands me! :-))

Yes, the can is really opened up when you start allowing keywords and
stuff.

I did think maybe one or both of a 'Visit' and a 'Meta' header would
be a reseonable idea:

'Visit' would allow URLs to be listed for retrieval, and nothing more.
So you could do:

| Disallow: /
| Visit: /welcome.html
| Visit: /products.html
| Visit: /keywords-and-overview-for-robots.html

Which would be kinda cool and simple, but doesn't scale well to many
URL's, or to more meta data.

'Meta' would specify a link to a seperate document, using some TBD
format (or formats, using content negotiation to pick text/url-list,
text/aliweb, urc/foo or whatever) which further guides content
selection and meta data for a site.

Other requests I've had are regular expression support (e.g. for
Disallow: *.html3), and allowing multiple paths per disallow line.

What do people think of the above? Any others?

-- Martijn

Email: m.koster@webcrawler.com
WWW: http://info.webcrawler.com/mak/mak.html