Re: Is a robot visiting?

Tim Freeman (tim@infoscreen.com)
Fri, 1 Nov 1996 11:52:10 -0800


>Why would you want to send different files anyway?

When a person comes and visits, we redirect to a URL with an embedded
random unique ID. (I know that this would be better done with
cookies, but not everyone has cookies.) It would be a shame if a
robot came, got a "unique" ID, and then gave everyone who used the
index the same "unique" ID.

>The HTTP_USER_AGENT field may provide some clues, such as including words
>like "robot" or "crawl" or "engine" but then some companies may call their
>browsers WebCrawler for some reason. Also in order to get the Netscape versions
>of dynamically created documents some crawlers send a Netscape
>like HTTP_USER_AGENT (something like "Mozilla(compatible-supercrawl)/3.0")

Looking at the robot list, I see these user agents:

BlackWidow,BackRub/*.*,root/0.1,Deweb/1.01,Hamahakki/0.2

Therefore I don't think searching for keywords in the USER_AGENT would
work. I don't want to try to update the robot list on a deployed
system.

>- guess from Accept headers if the client is likely to be a browser
> (very few search engines ask for image/* or audio/*)

An excellent idea. Thank you.

Given a choice among heuristics, I prefer simpler ones. So here's
what my plan is at the moment: if I find an instance of a browser of
measurably positive popularity that sends HTTP_FROM, use this test:

If it takes image/*, it's a browser.
elsif it has an HTTP_FROM, it's a robot.
else it is a browser.

otherwise simply go by the HTTP_FROM field.

Tim Freeman