Here's my suggestion:
1) robots send a new HTTP header to say "I'm a robot", e.g.
Robot: Indexers R us
2) servers are extended (e.g. an apache module) to look for this
header and based on local configuration (*) issues "403 Forbidden"
responses to robots that stray out of their allowed URL-space
3) (*) on the site being visited, the server would read robots.txt and
perhaps other configuration files (such as .htaccess) to determine
which URLs/directories are off-limits.
Using this system, a robot that correctly identifies itself as such will
not be able to accidentally stray into forbidden regions of the server
(well, they won't have much luck if they do, and won't cause damage).
Adding an apache module to the distribution would make more web admins
aware of robots.txt and the issues relating to it. Being the leader, Apache
can implement this and the rest of the pack will follow.
Followups to robots@webcrawler.com
rob
-- Rob Hartill (robh@imdb.com) The Internet Movie Database (IMDb) http://www.imdb.com/ ...more movie info than you can poke a stick at.