Re: Back of the envelope computations

John D. Pritchard (jdp@cs.columbia.edu)
Wed, 06 Nov 1996 12:48:25 -0500


bonjour francois,

> I started from an estimation of 100Murl in the http realm (we may consider
> not limiting the robot to text/html documents). The robot should be able to
> go around the word in 100 days or less (like 80 ;-), checking each URL.
>
> This requires a coverage of 1Murl per day.

max URL space each is 1024 bytes
=> max 1Gb per day
=> max 100Gb project space

> The robot would group URLs per site, to attempt to keep some
> consistency in a snapshot of a site; let's call an "agent" the program
> that crawls on a single site.
> If we limit the speed at which the agent fires requests to
> a given site to 1 per minute (or at least one couple HEAD/GET per minute),
> this means that it can cover a maximum of 1500 URLs per day.
> So, to get a coverage of 1Murl a day, this means that the robot needs
> about 700 simultaneous agents, say 1000.
>
> I'm a bit worried about the constraints that this figure imposes on
> the software or kernel tables, since we planned to use a single
> machine for the robot (say a PPro 200 under Linux). Some thoughts:
> * there will be about 1000 processes/threads. I guess this is not really
> a problem, but if anybody has this kind of experience on Linux, I'll be
> glad to have an opinion.

won't work. i don't know Linux, but i know that lots of threads slows
things down so that 100 threads is more realistic

> * there might be about 1000 simultaneous tcp connections. Does regular
> kernel network code can cope with that ?

solaris could have 1024, i dont know Linux, but the threading implied is
restricting, anyway.

> Other intriguing figures come up if we consider the robot in "acquisition"
> mode, meaning that it actually downloads all the documents behind the URLs.
> Assuming an average size of 10Kbytes per document, this makes 10Gbytes a day
> of data transfer. So the robot must have a bandwidth (speed of handling all
> the incoming data) of 100Kbytes/s (assuming 100,000 a seconds for a day).
> And then, the robot will constantly consume 1Mbit/s of network bandwidth.
>
> Is that the reality for the existing robots that cover the whole web ?

100Kbytes per second network bandwidth is very high, 800kbs -- you'll flood
your department's subnet. you need to put this on your institutions'
backbone, or tapped into one of your primary routers.

anyway, you need more than one machine to get even close to these goals.
more like 10 or 20. you can do this via cheap pentiums, eg 75MHz. since
the net is your restriction, you don't need high speed machines. you need
lots of them.

i don't know about Linux on various intels, but i would suggest looking
into threading capabilities of Linux on the cheapest intel boxes you can
get through your channels. like if you can find 10 or 20 386es, i would
seriously consider using them.

then a cluster of cheap machines could nfs to an nfs serving 101Gb disk
array.

then your software just has to make sensible use of these 10 or 20
sets of collection files.

let me know how it works out.

-john