Re: defining "robot"

Matthew K Gray (mkgray@MIT.EDU)
22 Nov 1996 00:39:41 -0500


I would propose the following as a definition for a robot, in terms of
what should and should not follow robots.txt.

--------------------
A "robot" is any piece of software which automatically retrieves web
pages/documents* and does not immediately present them via some
mechanism for human consumption before proceeding further.

* I define here "web pages/documents" as a file and any content it
inlines (this includes images, java applets, or other embedded
content, but not other documents it points to).
--------------------

By this definition, Netscape Navigator is not a robot, because it
retrieves only documents which will be immediately presented to a
person.

By this definition, a proxy server is not a robot, because it
retrieves only documents which will be immediately returned to a
client, which generally will be presenting it to a person. In the
case of a robot going through a proxy, it should clearly request
robots.txt. In any case, the proxy does not 'proceed further' without
direction. (On the other hand, a proxy which periodically mirrors
popular sites would be a robot)

By this definition, Navigator's "Check for changes in my bookmarks"
feature is a robot, and I am of the opinion that it should check for
/robots.txt.

By this definition, offline web readers are certainly robots.

I think, because this defnition of robots.txt is inclusive, there needs
to be some sort of mechanism for recording robot intent. That is,
I potentially would like to prevent part of my site from being hit
by a robot that is an "indexer" but not mind if it is hit by a "agent".

Of course, defining these categories of robots is not easy, but there are
certainly some obvious categories:

indexer Indexes web pages
mirror Makes local copies for presentation via proxy or offline web
viewer
agent-user Processes relatively few web pages on behalf of a fairly
directed user request (eg, Netscape's bookmark check)
agent-auto Like an agent-user, but not usually few pages (a research robot
trying to study web topology probably fits here)

There are probably others, and perhaps 'agent' and 'big-agent' need to be
better defined and split into more categories.

Something like a "Agent-Type:" line could do this, but requires
specing out in more detail what Agent-Type's exist. (Of course, we
could define a small number and an 'other' category, define more types
as they become apparent and need arises)

...Matthew
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html