Info on large scale spidering?

Nick Craswell (Nick.Craswell@anu.edu.au)
Tue, 21 Jan 1997 12:37:31 +1100


AltaVista claim that Scooter can hit 3 million pages per day. If this
is the case then, based on the number of pages they index, they should
be able to update their entire index every week or two. But I've heard
claims (in this mailing list even) that their index gets months out of
date. I wonder why?

I'm not just interested in AltaVista, though. Maybe having the biggest
and most up to date index isn't as high a priority for DEC as showing
off their AlphaServers. Anyway, I'm interested in the problems faced by
anyone who wants to index tens of millions of web pages (Hotbot, Excite,
Lycos etc). Many of these don't seem to update their indexes as often
as I'd expect and I was wondering why.

I was wondering if anyone knew about:

- how hard it is to write a spider which hits, say, 5 million pages per
day. How about 10 million (which is what Hotbot claim)? Would 50
million be impossible? (or just really expensive to run)? What are the
most difficult design problems you face when you want to hit pages that
fast?

- how expensive is it to set up and to run large scale spiders? What
kind of hardware do you need? What about network connections? What are
the ongoing costs? How do these costs increase if you increase your
spider's speed?

- any other limiting factors or problems in large scale web spidering.

thanks for any help,
Nick.

claims about spider speed:
http://altavista.digital.com/cgi-bin/query?pg=tmpl&v=about.html&what=news
http://www.hotbot.com/FAQ/faq-overview.html

-- 
Nick Craswell                     ph: 249 4001 (w)
Department of Computer Science  Mail: Nick.Craswell@anu.edu.au
Australian National University   Web: http://pastime.anu.edu.au/nick/
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html