Re: Info on large scale spidering?

Nick Craswell (Nick.Craswell@anu.edu.au)
Thu, 23 Jan 1997 09:21:16 +1100


Thanks for all the replies to my initial message. Here's some thoughts:

Otis Gospodnetic wrote:
> I think AltaVista lyes (sp?) abot it a little bit.

Yeah me too. Maybe their 3 million/day is based on "optimal network
conditions".

> they may be collecting ionfo at some more or less impressive rate, but they
> can't just add new data to the database - they probably have to reindex the
> database before it it becomes searcable, and I imagine that takes some time :)

They claim that they can crunch text at 1GB per hour, so if they are
indexing 150 GB of text, that means they can rebuild their index
weekly. So that's not a problem.

On a side note, I can't see why you couldn't spider monthly and rebuild
your index weekly (so update 1/4 of it per week). There's no such thing
as a snapshot of the web anyway. If you know a page has been added
changed or deleted then why not reflect this in your index as soon as
possible?

> P.S.
> if you don't mind me asking - why are you asking this ? Do you think you
> could collect more data than AltaVista and others did ? Just curious..

I'm a grad student looking at research topics. One of the things I am
interested in is how we will search the web in the future, given its
enormous growth. One of the ideas I have been considering is that of
"Distributed Information Retrieval", where instead of a large
centralised index you have lots of smaller ones distributed around the
net.

If I am going to talk about a new way of indexing the web, I need to
understand the old way though (even though there isn't much info
available).

Joe St Sauver wrote:
> The number of pages you can hit is partially a function of how you treat
> non-200 return codes, and how "polite" you are.

When people mention politeness do they mean the amount of time between
individual hits on a site? If so then is the answer to do a lot of site
parallelism? i.e. if your spiders are considered rude to hit a site 100
times over a time period, then why not hit 100 sites once each and
repeat.

Or is politeness the number of hits on a site in a day or in a week. If
it is the number of hits per week that determines your politeness, then
an index which traverses your whole site every week can't possibly be
considered polite! (if your site is big enough that is)

Still, if you decide on some maximum "polite" spidering hit rate, then
you may hit some sites weekly and other sites monthly. What doesn't
make sense is saying, "therefore we will rebuild our index every
month". If you can reflect new developments on www.netscape.com a week
later rather than a month later then why wouldn't you want to?

> For fifty million pages a day, you'd better plan on OC3 (155Mb/sec) into
> one of the major exchange points.

Yeah, I thought 50 million might be pushing it. :-)

Greg Fenton wrote:
> Crawling is the easy part. Then there is the indexing of the data to
> make it searchable. This takes time...a lot of time...and disk space.

Yeah, you need the hardware to crunch data as fast as it comes in. I
guess it helps if you manufacture high end machines (i.e. DEC). Even
more expensive otherwise.

You also need to spend lots of $$ on disk space. However, the amount of
disk space you need is dependant on how big your index is, not on how
fast you spider goes. So for a 50 million page index, you need x GB of
disk space whether you spider weekly or every 3 months. So disk space
doesn't limit your spider's speed.

> Another reason is the corporate aspect. These search engines offer a
> free service to the public. If this is not the company's primary
> business, then keeping the service in top shape is not a primary
> objective of the company.

This is a very convincing reason! They could spider faster if they
wanted but they just don't care. A lot of these problems could be
solved with clever programming and lots of hardware bucks, but the big
boys don't care enough about index updates to put in the necessary cash.

And why should they? It's a free service and even if advertising was
covering costs, why would they be willing to put up extra $$$ just to
run their spider n times faster?

> > What are the most difficult design problems you face when you want
> > to hit pages that fast?
>
> I don't think it is all that hard (I have hit 6 million running on
> 3 big machines).

Thanks for the info.

> Money == man power. Writing the code is one thing. Maintaining the
> machines and database is another.

I agree with this economic explanation of things.

Nick Arnett wrote:
> With HTML, the index overhead is more like 25-30 percent. Of course, 150 GB
> is still a big corpus.

People have mentioned needing lots of disk space. Does that mean for
150GB of text you need 200+GB of disk space (150GB of web pages + a
50+GB index)? Although I suppose if you were going to keep on disk
everything you pulled back you could at least compress it... Even so,
isn't that some sort of breach of copyright? I guess not as long as you
think of it as a very big proxy cache.

The reason you keep the actual pages on disk is so that when indexing,
if a page hasn't changed since you last hit it you can read it straight
of disk. Right? So spiders are always sending conditional GETs, and if
there's no change then going to their local copy of the page.

> We probably shouldn't expect the search service companies to reveal a great
> deal about their tools, since they're not selling them.

Yeah, good point. And they're in competition too!

Thanks again for all the info. If anyone has any further clues then
please let me know. I'll probably be looking at this stuff for a few
months.

-- 
Nick Craswell                     ph: 249 4001 (w)
Department of Computer Science  Mail: Nick.Craswell@anu.edu.au
Australian National University   Web: http://pastime.anu.edu.au/nick/
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html