Back of the envelope computations

Francois Rouaix (Francois.Rouaix@inria.fr)
Wed, 06 Nov 1996 17:35:46 +0100


Hi,
We are contemplating the idea of writing a robot for several academic
purposes (*not* another full text indexer). Before starting anything,
I made some "back of the envelope" computations, and I'd like to submit
them to your criticism.

I started from an estimation of 100Murl in the http realm (we may consider
not limiting the robot to text/html documents). The robot should be able to
go around the word in 100 days or less (like 80 ;-), checking each URL.

This requires a coverage of 1Murl per day.

The robot would group URLs per site, to attempt to keep some
consistency in a snapshot of a site; let's call an "agent" the program
that crawls on a single site.
If we limit the speed at which the agent fires rock^H^H^Hequests to
a given site to 1 per minute (or at least one couple HEAD/GET per minute),
this means that it can cover a maximum of 1500 URLs per day.
So, to get a coverage of 1Murl a day, this means that the robot needs
about 700 simultaneous agents, say 1000.

I'm a bit worried about the constraints that this figure imposes on
the software or kernel tables, since we planned to use a single
machine for the robot (say a PPro 200 under Linux). Some thoughts:
* there will be about 1000 processes/threads. I guess this is not really
a problem, but if anybody has this kind of experience on Linux, I'll be
glad to have an opinion.
* there might be about 1000 simultaneous tcp connections. Does regular
kernel network code can cope with that ?

Other intriguing figures come up if we consider the robot in "acquisition"
mode, meaning that it actually downloads all the documents behind the URLs.
Assuming an average size of 10Kbytes per document, this makes 10Gbytes a day
of data transfer. So the robot must have a bandwidth (speed of handling all
the incoming data) of 100Kbytes/s (assuming 100,000 a seconds for a day).
And then, the robot will constantly consume 1Mbit/s of network bandwidth.

Is that the reality for the existing robots that cover the whole web ?

--
Francois.Rouaix@inria.fr                   Projet Cristal - INRIA Rocquencourt