What's wrong with collecting robots.txt files from user directories
and merging them to one single /robots.txt? This gives the users as
much control as the webmaster wants them to have while not introducing
any new standards, is easier to implement and introduces less
overhead.
(A sample implementation, a perl script, was posted here by me
recently. It's at <URL:http://www.elma.fi/~jaakko/makerobots.perl>.)
* * *
While you are listening, I would like to hear comments about a very
simple protocol that enables webmasters to provide indexing robots
exactly the pages they want, and efficiently too. A new robots.txt
keyword 'Archive-file:' is introduced:
User-agent: *
Disallow: /cgi-bin
Disallow: /local
Disallow: /tmp
Archive-file: /tmp/files1.zip
Archive-file: /tmp/files2.zip
Archive-file: /tmp/files3.tar.gz
The referenced files are simply archives of all the .html and .txt
files in the server. They may be divided to one or more archives
because that enables the robot to request them one at a time with
If-modified-since: header, and thus saving bandwidth by not repeately
copying archives of static pages over the net.
.zip and .tar.gz formats at least should be supported by a robot.
The archive type should be identified with the Content-type response
header by the server.
If an archive-aware robot finds an archive-file header, it does
not traverse the server but just reads all the archives (if modified).
There are number of things that make me suspect that this is not
a good idea. I just wanted to share it.