Re: agents ignoring robots.txt

Erik Selberg (selberg@cs.washington.edu)
16 Oct 1996 14:49:26 -0700


> WebCrawler/2.0 and Scooter/1.0 made single appearences whereas
> MetaCrawler/1.2b made hundreds from various sites, but mostly
> from the cs.washington.edu domain.

MetaCrawler Caveat (here's where I pass the buck... :)
MetaCrawler obtains documents from various sources, such as Lycos,
WebCrawler, AltaVista, etc. There is an option for the user to verify
those sources, and the queries you're seeing are the results of users
verifying those results. Theoretically, MetaCrawler shouldn't worry
about robots.txt, because theoretically none of the services
MetaCrawler uses return things that were supposed to be ignored under
robots.txt. Theoretically.

What I do tend to see is that:
1) Services have a lot of stuff already in their database which is now
supposed to be ignored by robots.txt (e.g. the robots.txt file came
after the robot visited it, and the service hasn't deleted the data).

2) Not all major services follow the robots.txt guidelines; some are
worse than others (there seem to be some problems with wildcard
parsing I'm guessing).

End result is that MetaCrawler gets gobs of references that it
shouldn't, and when the user asks for verification, it goes and does
that. Great feature for users, somewhat annoying for services. I'm
also not intending on putting robots.txt support into MetaCrawler, as
the performance implications make it prohibitive (verification in real
time is bad enough, but attempting to grab the robots.txt and parsing
can make things much worse). I will admit that MetaCrawler should make
better use of its caching strategy so it doesn't retrieve the same
documents over and over again, and we are working on that.

I think this is just another example of why robots.txt isn't a good
scalable solution; Rob --- do I recall correctly that Apache will add
a "deny from user-agent" thing in .htaccess files? It seems that's a
much better way of achieving the same goals than robots.txt
modifications. Also solves the robots.txt creation AFTER the robot
came problem.

-Erik

-- 
				Erik Selberg
"I get by with a little help	selberg@cs.washington.edu
 from my friends."		http://www.cs.washington.edu/homes/selberg