Re: robots.txt extensions

Jaakko Hyvatti (Jaakko.Hyvatti@www.fi)
Thu, 11 Jan 1996 02:38:12 +0200 (EET)


Adam:
> If there is point in discussion additions pls read on --
> otherwise bin this mail.

Sure is. In the following I will comment your proposals from the
viewpoint 'Does it solve any problems? Will anyone implement it?',
because I think no one will implement any extensions if there is
nothing to gain and more specifically no problems to solve with them.

> http://info.webcrawler.com/mailing-lists/robots/0001.html
>
> seemed sensible enough -- to add expiry information for the
> robots.txt file itself. No response appears to have been given

Problem is: someone changes robots.txt while cached copy is trusted by
robot. Adding expiry info does not enforce sysadmins to not to edit
robots.txt before it's expiration, so it still has to be retrieved
with some sensible intervals before expiration if set too far away in
the future.

Retrieving robots.txt every 100th - 1000th GET or minimum 8 hours,
maximum couple of days will not increase net traffic and solves the
problem better than expiry fields. And because every robot has to
handle robots.txt expiration sensibly, no sysadmin sees this as a
problem and will not implement the new field.

> MinRequestInterval: X
>
> Minimum request interval in seconds, (0=no minimum),
> with a default, if missing, of 60.

There is no problem with request intervals with well-behaved robots,
and ill-behaving ones - will they obey it anyway? So there is no problem
and it does not even get solved :-) Again nobody will implement this.

> DefaultIndex: index.html
>
> Stating that XXXX/ and XXXX/index.html are identicle.

Checksums are easier and have to be implemented anyway, because
most sites will not have this field implemented. And 'cause checksums
work, this is unnecessary and no one will use it..

> CGIMask: *.cgi

Hmm. Disallow: with regular expressions would be more generic.
But again: how many such cases can be found that this is necessary?

> Finally -- I never understood why robots.txt was exclusion only.
> Why does it not have some of positive hints added? I.e. you are
> allowed & welcome to browse XXXX/fred.html. Was this a choice
> built upon pragmatism -- thinking that this would open a can of
> worms?

I do not believe it is a problem to give robots URLs, they are
pretty good at finding them themselves. Also listing an url in
robots.txt does not bring the robot for a visit - a submission
to the robot admin will.

On the other hand, lack of exclusion of robots from sites/URLs
was a severe problem and was well solved by robots.txt.

Also, while updating the information content of a site, sysadmins
and ordinary users surely will forget to update robots.txt.
(Directories are more static and therefore the current scheme works.)

I am sorry I sound quite negative.. Actually, the ideas might be
pretty good. I do not mean to be rude :-)

I actually have a new idea too:

Textarchive: /allpages.zip

or

Textarchive: /publicdocs.tar.gz

(or with any other compressed archive format) ..instructs robots to
fetch all there is in a compressed format. Is this a simple enough
interface for everyone to accept? Too simple?