Re: Is a robot visiting?

Klaus Johannes Rusch (e8726057@student.tuwien.ac.at)
Fri, 1 Nov 1996 12:16:14 CET


In <199611010557.VAA00282@infoscreen.com>, Tim Freeman <tim@infoscreen.com> writes:
> At http://info.webcrawler.com/mak/projects/robots/guidelines.html it says:
>
> Identify yourself
>
> HTTP supports a From field to identify the user who runs the WWW
> browser. Use this to advertise your email address
> e.g. "j.smith@somehwere.edu". This will allow server maintainers to
> contact you in case of problems, so that you can start a dialogue on
> better terms than if you were hard to track down
>
> ...
> Are there any browsers out there that pass something for HTTP_FROM?

The From: header is a general purpose header for indentifying the client, be
it a browser or a human. Not all browsers send it, but you should not conclude
from the existence of a From: header it's necessarily a crawler.

The HTTP_USER_AGENT field may provide some clues, such as including words
like "robot" or "crawl" or "engine" but then some companies may call their
browsers WebCrawler for some reason. Also in order to get the Netscape versions
of dynamically created documents some crawlers send a Netscape
like HTTP_USER_AGENT (something like "Mozilla(compatible-supercrawl)/3.0")

If you don't need 100% accuracy you could
- watch the list of active robots, and send those the robots version
- check for special terms in the HTTP_USER_AGENT and perhaps also the
HTTP_FROM fields
- track accesses to your server and assume a visitor starting with the
robots.txt file is probably a robot (watch our for crawlers running
from multiple hosts, such as crawlxxx.supercrawl.com)
- guess from Accept headers if the client is likely to be a browser
(very few search engines ask for image/* or audio/*)

One more thing: some crawlers are actually proxy servers with an indexing
engine, so which version would you send those clients? There may be a human
interested in your robots.txt file but at the same time, the engine will pick
up the file.

Why would you want to send different files anyway?

Klaus Johannes Rusch

--
e8726057@student.tuwien.ac.at, KlausRusch@atmedia.net
http://www.atmedia.net/KlausRusch/