I think the bottom line still is how to determine if
something is a robot. On one hand, to say that anything
that is automatic is a robot is too extreme, given
the example somebody made of a user-agent that once-a-day
accesses a page on the direct behalf of a single user
hardly constitutes a robot, and in fact is doing little
more than the user himself would have done.
On the other, to say that it is only a robot if it is
following html links and sucking up whole pages is
too narrow at the other extreme.
To define it based on the volume of access is too
difficult. To define it based on, say, whether it is
acting on behalf of a single person (a user agent)
or multiple people (creating an index) is not getting
at the real issue (overload of the target machine).
My inclination is to suggest that we create a kind of
robot Turing test. We can call it the xxx.lanl test
in honor of our good friends over there. Basically,
the test says that, if the site cannot tell that it is
robot, then it is not a robot, and does not have to
follow the robots.txt exclusion. The reverse also
holds---if the site thinks it is a robot, then it is
a robot, and must follow the exclusion.
I also recommend that, before one launches any automatic
process that intends not to follow the robot exclusion,
they first try it on xxx.lanl to see if it in fact
passes the xxx.lanl test.
PF