Re: defining "robot"

Martijn Koster (m.koster@webcrawler.com)
Sat, 23 Nov 1996 08:54:26 -0800


At 12:39 AM 11/22/96, Matthew K Gray wrote:

>A "robot" is any piece of software which automatically retrieves web
>pages/documents* and does not immediately present them via some
>mechanism for human consumption before proceeding further.
>
>* I define here "web pages/documents" as a file and any content it
>inlines (this includes images, java applets, or other embedded
>content, but not other documents it points to).

Interesting enough, and maybe as good as possible/any. However...

>By this definition, a proxy server is not a robot, because it
>retrieves only documents which will be immediately returned to a
>client, which generally will be presenting it to a person. In the
>case of a robot going through a proxy, it should clearly request
>robots.txt. In any case, the proxy does not 'proceed further' without
>direction. (On the other hand, a proxy which periodically mirrors
>popular sites would be a robot)

Well, that's where I'm not convinced completely. Say I have a caching
proxy, only used by humans. Say the proxy updates often-retrieved pages
every hour, to ensure they're up-to-date. It may even be the case the cache
does less retrievals than direct users would have generated.
The pages were originally requested by a human, and continue to be
requested by a human. The refreshing is done without human intervention
though, and under your definition qualifies as a robot.
Yet it's not your typical "blind link-following" robot that people have
in mind, and I would have difficulty with calling that a robot.

>By this definition, Navigator's "Check for changes in my bookmarks"
>feature is a robot, and I am of the opinion that it should check for
>/robots.txt.

Again, these pages were retrieved by a human the first time. Now the same
user just repeats the same retrievals in a slightly less RSI-inducing
fashion. So what makes these cases so different that server admins
would want to administer them differently?

To me, a single page-watcher is not a robot. Whereas a metacrawler
that parses search-engine results page, and validates links found
there (repeatedly), is a robot.

Theses cases are not far-fetched...

To me what makes a robot a robot is that it retrieves a page, greps
for (non-inline) links, and wants to retrieve _those_. So it's an
initial retrieval, followed by non-human-controlled further retrievals.
And for the hecklers among us, no I don't even pretend that that is
a complete unambiguous and correct definition.

-- Martijn

Email: m.koster@webcrawler.com
WWW: http://info.webcrawler.com/mak/mak.html

_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html