Re: nastygram from xxx.lanl.gov

Michael De La Rue (mikedlr@indy.unipress.waw.pl)
Wed, 10 Jul 1996 13:49:03 +0200


In message <9607100135.AA20669@cactus.slab.ntt.jp>, Paul Francis writes:
>I think the bottom line still is how to determine if
>something is a robot. On one hand, to say that anything

It's automatic.

>that is automatic is a robot is too extreme, given
>the example somebody made of a user-agent that once-a-day
>accesses a page on the direct behalf of a single user
>hardly constitutes a robot, and in fact is doing little
>more than the user himself would have done.

Yes, but the specific point about a browser following a link is that
it ALWAYS has user interaction and monitoring. If it turns out that
the browser is downloading 5000MB of garbage (this could be a server
error in the HEAD request, so don't claim it's not applicable in this
case) then the user will stop it somewhere between 50k and 1MB. The
robot won't. Just as long as there is an agent with net access and no
human there is potential for disaster.

The test should be

do you select each link to be traversed before it is

do you examine the output of that link as it downloads or at
least monitor the size / time it's taking

If not, it's a robot.

>[well I've disagreed at an earlier level, so snip]
>
>My inclination is to suggest that we create a kind of
>robot Turing test. We can call it the xxx.lanl test
>in honor of our good friends over there. Basically,
>the test says that, if the site cannot tell that it is
>robot, then it is not a robot, and does not have to
>follow the robots.txt exclusion. The reverse also
>holds---if the site thinks it is a robot, then it is
>a robot, and must follow the exclusion.

But, there are many things which could do robot style damage
(automatic uncontrolled downloads) to my server without me being able
to tell. For example things which check for updates on pages can do
this just need enough of them. If I say no robots that's what I mean.

If I need an exception to the robots exclusion protocol, what I'll do
is go and ask the site admin. Is this so difficult? Generally it's
the first email address on the top page of the whole site.

>I also recommend that, before one launches any automatic
>process that intends not to follow the robot exclusion,
>they first try it on xxx.lanl to see if it in fact
>passes the xxx.lanl test.

and then we get xxx.lanl going more and more hair trigger (it's not
netscape, so it must be a robot) as they try to get these crazy robots
people who are now carrying out more traffic from there with robot
testing than they served in the first place :-)

I do think we should get a better robot exculsion protocol which
distinguishes

head requests / body requests

time of day (for their free time)

maximum rate of requests (after this many k, wait this long?)

how often we update our robots.txt

reason for robot (link checking / indexing / finding junk
email lists )

<http://www.tardis.ed.ac.uk/~mikedlr/biography.html>
Scottish Climbing Archive: <http://www.tardis.ed.ac.uk/~mikedlr/climbing/>
Linux/Unix clone@ftp://src.doc.ic.ac.uk/packages/linux/sunsite.unc-mirror/docs/