The last time I checked, several months ago, only 6 or 7 of the 15 engines
we use had a robots.txt file, and even fewer disallowed access to /cgi-bin
(something like 3 if I remember correctly). Cyber411 currently ignores
robots.txt, since (if I may rationalize things here) it only grabs one page
(the results page) and then, only under the direction of a human using
Cyber411 (at that time). In this reguard, it is a human controlled agent
acting on behalf of a human, who is sitting there looking at the results as
they come in [1].
I ask this because it isn't that inconceivable to see a plug-in (or a
separate program) being written that does what Cyber411 does (maybe without
the ads). At what point does an agent NEED to follow the (n)robots.txt
convention? Since I think current versions of Lynx allow the following:
lynx -traverse http://www.cyber411.com/
(I think I have the correct option) which can be just as bad as a rogue
robot.
> Apart from that, every other robot/crawler/.. has behaved.
>
For what it's worth, Cyber 411 sends the following:
Agent: Cyber411/version OS/version
From: www.cyber411.com
(version is currently 0.9.10C and currently will either be run from a
IRIX/5.3.1, Linux/1.2.13 or Linux/2.0.0 system)
Oh, and while I'm here, is there anyway the Powers that Be that run this
list can have a Reply-To: header added? If I'm not careful, I'll end up
sending mail to an individual when it was intended for this list (and it's
happened a few times).
-spc (Working on this has piqued my interest in robots though ... )
[1] I am unaware of anyone using Cyber411 to conduct searches
autonomously or along the same method of us using the various
engines. I personally would be amused at such a thought, although
the company that hired us to do this would probably see things
differently 8-)