When a person comes and visits, we redirect to a URL with an embedded
random unique ID. (I know that this would be better done with
cookies, but not everyone has cookies.) It would be a shame if a
robot came, got a "unique" ID, and then gave everyone who used the
index the same "unique" ID.
>The HTTP_USER_AGENT field may provide some clues, such as including words
>like "robot" or "crawl" or "engine" but then some companies may call their
>browsers WebCrawler for some reason. Also in order to get the Netscape versions
>of dynamically created documents some crawlers send a Netscape
>like HTTP_USER_AGENT (something like "Mozilla(compatible-supercrawl)/3.0")
Looking at the robot list, I see these user agents:
BlackWidow,BackRub/*.*,root/0.1,Deweb/1.01,Hamahakki/0.2
Therefore I don't think searching for keywords in the USER_AGENT would
work. I don't want to try to update the robot list on a deployed
system.
>- guess from Accept headers if the client is likely to be a browser
> (very few search engines ask for image/* or audio/*)
An excellent idea. Thank you.
Given a choice among heuristics, I prefer simpler ones. So here's
what my plan is at the moment: if I find an instance of a browser of
measurably positive popularity that sends HTTP_FROM, use this test:
If it takes image/*, it's a browser.
elsif it has an HTTP_FROM, it's a robot.
else it is a browser.
otherwise simply go by the HTTP_FROM field.
Tim Freeman