Re: The Internet Archive robot

Z Smith (zsmith@archive.org)
Thu, 03 Oct 1996 10:39:23 -0700


Several messages have appeared expressing concern that the Internet Archive
robot might, as a side-effect of its retrieving copies of pages to add to
the archive, unduly load servers.

As some members of the list have been quick to point out, our robot is
(almost) no different than any other search-engine indexing robot, so all
the exclusion rules and load-politeness etiquette that applies to them
applies to us.

The obvious difference between a site being visited by our robot vs that of
one of the 5 major search engines is that they just get the .html files and
we get the .gif and .jpg files as well. (Eventually, other object types
will be archived as well.) The search engine companies usually throw away
their .html files once they've parsed and indexed them, of course, while we
save them.

So far, we are finding that (gif + jpg) space is not all that much larger
than html space (averaged over many sites; your mileage may vary.) So, in
terms of load, being visited by our robot is like being visited by 2 of the
search engine robots.

As to the idea of allowing web-site administrators to specify what time of
day they would prefer to be crawled, so as to minimize visits during heavily
loaded periods: This is perhaps not a bad idea, but it probably isn't
necessary; any crawler designer worth his salt has a built-in incentive to
visit sites when they are lightly loaded---it makes the crawl go quicker.
Once the crawler has retrieved a few pages from a site at various times of
day, a picture of the load pattern begins to emerge, and the
best-time-to-visit for other pages can be scheduled.

Z

At 08:51 AM 9/23/96 -0700, you wrote:
>At 10:53 AM 9/23/96, Jeremy Sigmon wrote:
>
>>For now a few cron jobs could switch between a few robots.txt files during
>>appropriate times of the day.
>
>I don't know how often you expect robots to update their cached version
>of /robots.txt, but often used values are a day or a week. So the above
>won't work...
>
>
>-- Martijn
>
>Email: m.koster@webcrawler.com
>WWW: http://info.webcrawler.com/mak/mak.html
>
>
>
--------------------------------------------------------------------------
Z Smith Phone: 415-561-6799
Internet Archive Facsimile: 415-561-6795
The Presidio, Building 1014, Room 102 E-mail: zsmith@archive.org
(Find our location in real-space: http://www.archive.org/directions.html )
PO Box 29141
San Francisco, California, USA 94129-0141