--------------------
A "robot" is any piece of software which automatically retrieves web
pages/documents* and does not immediately present them via some
mechanism for human consumption before proceeding further.
* I define here "web pages/documents" as a file and any content it
inlines (this includes images, java applets, or other embedded
content, but not other documents it points to).
--------------------
By this definition, Netscape Navigator is not a robot, because it
retrieves only documents which will be immediately presented to a
person.
By this definition, a proxy server is not a robot, because it
retrieves only documents which will be immediately returned to a
client, which generally will be presenting it to a person. In the
case of a robot going through a proxy, it should clearly request
robots.txt. In any case, the proxy does not 'proceed further' without
direction. (On the other hand, a proxy which periodically mirrors
popular sites would be a robot)
By this definition, Navigator's "Check for changes in my bookmarks"
feature is a robot, and I am of the opinion that it should check for
/robots.txt.
By this definition, offline web readers are certainly robots.
I think, because this defnition of robots.txt is inclusive, there needs
to be some sort of mechanism for recording robot intent. That is,
I potentially would like to prevent part of my site from being hit
by a robot that is an "indexer" but not mind if it is hit by a "agent".
Of course, defining these categories of robots is not easy, but there are
certainly some obvious categories:
indexer Indexes web pages
mirror Makes local copies for presentation via proxy or offline web
viewer
agent-user Processes relatively few web pages on behalf of a fairly
directed user request (eg, Netscape's bookmark check)
agent-auto Like an agent-user, but not usually few pages (a research robot
trying to study web topology probably fits here)
There are probably others, and perhaps 'agent' and 'big-agent' need to be
better defined and split into more categories.
Something like a "Agent-Type:" line could do this, but requires
specing out in more detail what Agent-Type's exist. (Of course, we
could define a small number and an 'other' category, define more types
as they become apparent and need arises)
...Matthew
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html