Re: agents ignoring robots.txt

Captain Napalm (spc@armigeron.com)
Wed, 16 Oct 1996 17:51:11 -0400 (EDT)


>From some obscure corner of the Matrix, Rob Hartill was seen transmitting:
>
> I've started logging user agents that attempt to access areas of my server
> that robots.txt are supposed to keep them away from.
>
I've kept quiet about this, but I'm curious as to the reaction of
something along the lines of the Cyber411 Meta Search Engine
(http://www.cyber411.com/search/). Well, besides the obvious ethical, moral
and legal ramnifications (I was hired to write the software end of things -
the company that hired me (us - my company actually) has been made aware of
the possible ethical, moral and legal ramnifications and they still went
ahead).

The last time I checked, several months ago, only 6 or 7 of the 15 engines
we use had a robots.txt file, and even fewer disallowed access to /cgi-bin
(something like 3 if I remember correctly). Cyber411 currently ignores
robots.txt, since (if I may rationalize things here) it only grabs one page
(the results page) and then, only under the direction of a human using
Cyber411 (at that time). In this reguard, it is a human controlled agent
acting on behalf of a human, who is sitting there looking at the results as
they come in [1].

I ask this because it isn't that inconceivable to see a plug-in (or a
separate program) being written that does what Cyber411 does (maybe without
the ads). At what point does an agent NEED to follow the (n)robots.txt
convention? Since I think current versions of Lynx allow the following:

lynx -traverse http://www.cyber411.com/

(I think I have the correct option) which can be just as bad as a rogue
robot.

> Apart from that, every other robot/crawler/.. has behaved.
>
For what it's worth, Cyber 411 sends the following:

Agent: Cyber411/version OS/version
From: www.cyber411.com

(version is currently 0.9.10C and currently will either be run from a
IRIX/5.3.1, Linux/1.2.13 or Linux/2.0.0 system)

Oh, and while I'm here, is there anyway the Powers that Be that run this
list can have a Reply-To: header added? If I'm not careful, I'll end up
sending mail to an individual when it was intended for this list (and it's
happened a few times).

-spc (Working on this has piqued my interest in robots though ... )

[1] I am unaware of anyone using Cyber411 to conduct searches
autonomously or along the same method of us using the various
engines. I personally would be amused at such a thought, although
the company that hired us to do this would probably see things
differently 8-)