Re: Info on large scale spidering?

Otis Gospodnetic (otisg@panther.middlebury.edu)
Mon, 20 Jan 1997 22:42:37 -0500 (EST)


On Tue, 21 Jan 1997, Nick Craswell wrote:

> AltaVista claim that Scooter can hit 3 million pages per day. If this
> is the case then, based on the number of pages they index, they should
> be able to update their entire index every week or two. But I've heard
> claims (in this mailing list even) that their index gets months out of
> date. I wonder why?

I think AltaVista lyes (sp?) abot it a little bit.
I had a chance to look at altaVista book, and I remember seeing some HUGE
number there. I don't think the number is correct.
If that would be so Scooter would have visted sites I asked him to visit long
ago by now.
Iff that number is correct then Scooter is either one ugly inpolite robot or a
very intelligent one who really knows how to keep track of sites he's seen.

> I'm not just interested in AltaVista, though. Maybe having the biggest
> and most up to date index isn't as high a priority for DEC as showing
> off their AlphaServers. Anyway, I'm interested in the problems faced by
> anyone who wants to index tens of millions of web pages (Hotbot, Excite,
> Lycos etc). Many of these don't seem to update their indexes as often
> as I'd expect and I was wondering why.

they may be collecting ionfo at some more or less impressive rate, but they
can't just add new data to the database - they probably have to reindex the
database before it it becomes searcable, and I imagine that takes some time :)

> I was wondering if anyone knew about:
>
> - how hard it is to write a spider which hits, say, 5 million pages per
> day. How about 10 million (which is what Hotbot claim)? Would 50
> million be impossible? (or just really expensive to run)? What are the
> most difficult design problems you face when you want to hit pages that
> fast?

no, it's not impossible at all.
you write a spider, you put it on 10, 15, 20+ machines, and let them
loose.
They shouldn't take lots of CPU cycles since they don't have to compute things
too much. The only thing they could take is yout T1/T3+ line.

> - how expensive is it to set up and to run large scale spiders? What
> kind of hardware do you need? What about network connections? What are
> the ongoing costs? How do these costs increase if you increase your
> spider's speed?

I'd worry about the following:
do I have enough disk space for all the data I collect ?
AltaVista doesn't have that problem :) Terabytes...
is my T1/T3+ enough ?
am I going to kill somebody's web servers ? Make sure you avoid www.imdb.com
;)

> - any other limiting factors or problems in large scale web spidering.
I guess what I'm saying is thatit's not hard to do it, as long as you don't
care what people say about you and your spiders.

I could be wrong, so please correct me if that is so.

Otis
P.S.
if you don't mind me asking - why are you asking this ? Do you think you
could collect more data than AltaVista and others did ? Just curious..

_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html