>
> (sent to robots@webcrawler.com and the apache developers list)
>
> Here's my suggestion:
>
> 1) robots send a new HTTP header to say "I'm a robot", e.g.
> Robot: Indexers R us
>
> 2) servers are extended (e.g. an apache module) to look for this
> header and based on local configuration (*) issues "403 Forbidden"
> responses to robots that stray out of their allowed URL-space
>
> 3) (*) on the site being visited, the server would read robots.txt and
> perhaps other configuration files (such as .htaccess) to determine
> which URLs/directories are off-limits.
>
> Using this system, a robot that correctly identifies itself as such will
> not be able to accidentally stray into forbidden regions of the server
> (well, they won't have much luck if they do, and won't cause damage).
>
> Adding an apache module to the distribution would make more web admins
> aware of robots.txt and the issues relating to it. Being the leader, Apache
> can implement this and the rest of the pack will follow.
It doesn't add any functionalityy and does add overhead. Poorly
written/run robots that do not follow REP now are not going to issue a
'Robots:' HTTP header, either. Additionally - it would increase abuse
attempts by servers to 'lie to the robots just to improve my search
position' by allowing servers to more easily serve something *different*
to robots than it actually serves to people. It happens now - but not as
much because the protocal doesn't provide a *direct* way to identify all
robots (they can, of course, key on the User-Agent or IP, but it requires
more work on their part).
Lastly, the place to bring up HTTP protocal changes is not the robots list
or the Apache list, but the IETF HTTP-WG. It might have interactions with
the rest of HTTP that require working out.
-- Benjamin Franz