Re: Suggestion to help robots and sites coexist a little better

Jaakko Hyvatti (Jaakko.Hyvatti@Elma.FI)
Thu, 18 Jul 1996 01:52:57 +0300 (EET DST)


Scott:
> If separate instructions files were used, there could be additional
> instructions added to let the robot know whether it should look for
> robots.txt files in deeper directories. It might make the robot
> writers job harder, but would allow users more control while still
> allowing the administrator of the site to prevent robots from
> endlessly looking for robots.txt files in directories where they do

What's wrong with collecting robots.txt files from user directories
and merging them to one single /robots.txt? This gives the users as
much control as the webmaster wants them to have while not introducing
any new standards, is easier to implement and introduces less
overhead.

(A sample implementation, a perl script, was posted here by me
recently. It's at <URL:http://www.elma.fi/~jaakko/makerobots.perl>.)

* * *

While you are listening, I would like to hear comments about a very
simple protocol that enables webmasters to provide indexing robots
exactly the pages they want, and efficiently too. A new robots.txt
keyword 'Archive-file:' is introduced:

User-agent: *
Disallow: /cgi-bin
Disallow: /local
Disallow: /tmp
Archive-file: /tmp/files1.zip
Archive-file: /tmp/files2.zip
Archive-file: /tmp/files3.tar.gz

The referenced files are simply archives of all the .html and .txt
files in the server. They may be divided to one or more archives
because that enables the robot to request them one at a time with
If-modified-since: header, and thus saving bandwidth by not repeately
copying archives of static pages over the net.

.zip and .tar.gz formats at least should be supported by a robot.
The archive type should be identified with the Content-type response
header by the server.

If an archive-aware robot finds an archive-file header, it does
not traverse the server but just reads all the archives (if modified).

There are number of things that make me suspect that this is not
a good idea. I just wanted to share it.