Re: Info on large scale spidering?

Greg Fenton (gregf@mks.com)
Tue, 21 Jan 1997 09:14:51 -0500


Nick Craswell wrote:
>
> AltaVista claim that Scooter can hit 3 million pages per day.
> [...] I wonder why?

Crawling is the easy part. Then there is the indexing of the data to
make it searchable. This takes time...a lot of time...and disk space.
Depending on the structure for their crawling and the technology that
they use for indexing that data, they could end up using 3 times the
disk space necessary to hold all of the data that they crawled.

Then there is the administration headaches. Not only do they have to
crawl all of this data, then index it, but the disk space used for
these steps have to be made available first to the crawlers, then
the indexer, then the search engine. This is non-trivial, especially
when dealing with millions of pages and millions of searches per day.

> [...]Hotbot, Excite, Lycos etc.[...] Many of these don't seem to
> update their indexes as often as I'd expect and I was wondering why.

Another reason is the corporate aspect. These search engines offer a
free service to the public. If this is not the company's primary
business, then keeping the service in top shape is not a primary
objective of the company.

I don't know for sure, but I would suspect that many of these search
engines are losing *lots* of money for their companies. I once heard,
from a pseudo-reliable search engine company source, that Alta Vista
had spent upwards of $35 million and had only earned approximately
$1million in advertising revenue (if that). A company is not going to
throw their top engineers at a problem that occurs in such a system.

> - how hard it is to write a spider which hits, say, 5 million pages
> per day. [...]
> What are the most difficult design problems you face when you want
> to hit pages that fast?

I don't think it is all that hard (I have hit 6 million running on
3 big machines). I think that it would by quite easy to do, if you
used a bunch of small machines (say, 10 pentiums).

Like I said, crawling is the easy part. The average size of a text
page that I calculated on a base 1.5 million pages is 7.5kb
[Can anyone confirm/deny?]. So you crawl 20 million pages, now what do
you do with the 150 Gb[!!] of data? You don't just grep it. You have
to build an index, which will take at least another 50% of that space
(I'm not an algorithms person, I'm going by gut feel).

> - how expensive is it to set up and to run large scale spiders? What
> kind of hardware do you need? What about network connections?

Sitting on a T3 would help :-). Hardware: a bunch of pentiums or a
couple of bigger Unix boxes. The hardware limitation will be in the
disk access (memory shouldn't be a problem). The software limitation
will be in the number of TCP/IP connections that your OS can support
in a few minutes.

> - any other limiting factors or problems in large scale web spidering.

Money == man power. Writing the code is one thing. Maintaining the
machines and database is another.

gregf.

-- 
Greg Fenton           | Email: gregf@mks.com       | Opinions 
Professional Services | Phone: +1-519-883-3265     | expressed herein
MKS Inc.              |   Web: http://www.mks.com/ | are mine.
---------------------------------------------------------------------
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html