Re: Info on large scale spidering?

Nick Arnett (narnett@Verity.COM)
Tue, 21 Jan 1997 09:05:58 -0800


At 09:14 AM 1/21/97 -0500, Greg Fenton wrote:

>Crawling is the easy part. Then there is the indexing of the data to
>make it searchable. This takes time...a lot of time...

I'd be interested in knowing what sort of throughput is actually possible on
a T-1 if a spider "owns" it. I doubt if it's anywhere near high enough to
generate more text than a high-end workstation can index in real time.

>Then there is the administration headaches. Not only do they have to
>crawl all of this data, then index it, but the disk space used for
>these steps have to be made available first to the crawlers, then
>the indexer, then the search engine. This is non-trivial, especially
>when dealing with millions of pages and millions of searches per day.

Excuse me, but this does not have to be true. Crawling, indexing and search
can happen simultaneously, along with hot backups, on the same index.
Still, there are complex administrative issues related to dividing up the
work, etc.

>So you crawl 20 million pages, now what do
>you do with the 150 Gb[!!] of data? You don't just grep it. You have
>to build an index, which will take at least another 50% of that space
>(I'm not an algorithms person, I'm going by gut feel).

With HTML, the index overhead is more like 25-30 percent. Of course, 150 GB
is still a big corpus.

>Sitting on a T3 would help :-). Hardware: a bunch of pentiums or a
>couple of bigger Unix boxes. The hardware limitation will be in the
>disk access (memory shouldn't be a problem). The software limitation
>will be in the number of TCP/IP connections that your OS can support
>in a few minutes.

Disk I/O is typically the bottleneck for indexing and low-volume searching,
but the bottleneck eventually becomes the CPU as more simultaneous users are
added.

We probably shouldn't expect the search service companies to reveal a great
deal about their tools, since they're not selling them.

Nick

---------------------------------------
Verity Inc.
Connecting People with Information

Product Manager, Categorization and Visualization
408-542-2164; home office 408-369-1233; fax 408-541-1600
http://www.verity.com

_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html