Re: Suggestion to help robots and sites coexist a little better

Rob Hartill (robh@imdb.com)
Mon, 15 Jul 1996 15:06:40 -0500 (CDT)


'Benjamin Franz' wrote
>
>On Sun, 14 Jul 1996, Rob Hartill wrote:
>
>>
>> (sent to robots@webcrawler.com and the apache developers list)
>>
>> Here's my suggestion:
>>
>> 1) robots send a new HTTP header to say "I'm a robot", e.g.
>> Robot: Indexers R us
>>
>> 2) servers are extended (e.g. an apache module) to look for this
>> header and based on local configuration (*) issues "403 Forbidden"
>> responses to robots that stray out of their allowed URL-space
>>
>> 3) (*) on the site being visited, the server would read robots.txt and
>> perhaps other configuration files (such as .htaccess) to determine
>> which URLs/directories are off-limits.
>>
>> Using this system, a robot that correctly identifies itself as such will
>> not be able to accidentally stray into forbidden regions of the server
>> (well, they won't have much luck if they do, and won't cause damage).
>>
>> Adding an apache module to the distribution would make more web admins
>> aware of robots.txt and the issues relating to it. Being the leader, Apache
>> can implement this and the rest of the pack will follow.
>
>
>It doesn't add any functionalityy and does add overhead.

The overhead is peanuts compared to the network traffic that'll be saved
when robots... add a few bytes and save a few megabytes.

>Poorly
>written/run robots that do not follow REP now are not going to issue a
>'Robots:' HTTP header, either.

True, but this makes it easier to improve a rogue robot without the
need to have someone intelligent operating it.

>Additionally - it would increase abuse
>attempts by servers to 'lie to the robots just to improve my search
>position' by allowing servers to more easily serve something *different*
>to robots than it actually serves to people.

So it's a bad idea to add a system to limit abuse of servers just in
case robot using indexers get abused. Not a convincing argument for the
suffering people who's existence justifies robots in the first place.

>It happens now - but not as
>much because the protocal doesn't provide a *direct* way to identify all
>robots (they can, of course, key on the User-Agent or IP, but it requires
>more work on their part).

So you're happy to keep making it difficult (impractical)
to detect a robot from athe server side so that your
life as a robot owner is made easier. That's just selfish.

The robot owners can trade lists of sites that abuse their service
now and in the future. Those sites can be avoided as punishment.

>Lastly, the place to bring up HTTP protocal changes is not the robots list
>or the Apache list, but the IETF HTTP-WG. It might have interactions with
>the rest of HTTP that require working out.

As an apache developer, I can suggest whatever I want to the others. As
a member of this list I think I can do the same here. Roy Fielding and others
from the HTTP camp are on one or both of these lists anyway.

Probably pointless to suggest it to HTTP-WG if people like you are going
to be selfish and ignore it.

rob

-- 
Rob Hartill (robh@imdb.com)
The Internet Movie Database (IMDb)  http://www.imdb.com/
           ...more movie info than you can poke a stick at.