Re: Back of the envelope computations

Sigfrid Lundberg (siglun@gungner.ub2.lu.se)
Thu, 7 Nov 1996 09:30:01 +0100 (MET)


On Wed, 6 Nov 1996, Francois Rouaix wrote:

> Hi,
> We are contemplating the idea of writing a robot for several academic
> purposes (*not* another full text indexer). Before starting anything,
> I made some "back of the envelope" computations, and I'd like to submit
> them to your criticism.
>
> I started from an estimation of 100Murl in the http realm (we may consider
> not limiting the robot to text/html documents). The robot should be able to
> go around the word in 100 days or less (like 80 ;-), checking each URL.
>
> This requires a coverage of 1Murl per day.

100Murl is presumably the correct order of magnitude. Fair enough.

>
> The robot would group URLs per site, to attempt to keep some
> consistency in a snapshot of a site; let's call an "agent" the program
> that crawls on a single site.

This one site <-> one agent idea is unfortunately less than excellent,
from my point of view.

> If we limit the speed at which the agent fires rock^H^H^Hequests to
> a given site to 1 per minute (or at least one couple HEAD/GET per minute),

That is what I do.

> this means that it can cover a maximum of 1500 URLs per day.
> So, to get a coverage of 1Murl a day, this means that the robot needs
> about 700 simultaneous agents, say 1000.

Now, with 700-1000 agents, each of which are firing a request per
minute you are filling your process table with idle processes. A much
lower number of agents would be needed if you implemented a controlling
process which schedules the accesses to remote webservers.

If now other measures are taken than scheduling, it is our
experience that the speed of harvesting will depend on the efficiency
of the controlling process, which in turn is dependent on the size
distribution of the web servers. And that distribution is skewed. I
would guess that as a rule of thumb you can assume that 10% of the
servers are providing 90% of the documents (then a significant
proportion of these are mirror stuff, replicates of omnipresent
objects).

Therefore, the lowest possible rate of harvesting (or whatever
academic things you will be doing) will eventually decrease to one
access per minute (when all URLs you have left will be situated on one
single host--the biggest one on the surface of the planet). In reality
this will never, since new URLs appear all the time.

This is without a lot of other clever things one might do to ensure
that one is harvesting interesting/important objects at higher
priority (with many links pointing to them, say), that slower servers
get lower priority and so forth.

Generally speaking, these clever things (and the scheduling policies
itself) are the processes that will confer a competitive edge on the
harsh WWW search index market, and how people go about achieving
efficient harvesting are trade secrets. I've never seen discussion of
it here, unfortunately. Things have changed significantly since 1994,
when all of this was an area of academic research.

>
> I'm a bit worried about the constraints that this figure imposes on
> the software or kernel tables, since we planned to use a single
> machine for the robot (say a PPro 200 under Linux). Some thoughts:
> * there will be about 1000 processes/threads. I guess this is not really
> a problem, but if anybody has this kind of experience on Linux, I'll be
> glad to have an opinion.
> * there might be about 1000 simultaneous tcp connections. Does regular
> kernel network code can cope with that ?

One in our team is using a PPro 200 under Linux for harvesting
Norway. His machine is equipped with 256 MB of memory, and a lot of
disk (obviously). With 15 copies of our (not very sofisticated) robots
(and quite a few of them will be idle most of the time) he will
harvest about 6000 pages an hour at top speed (before the efficiency
of the scheduling declines). And this is measured as the number of
records written only, other database operations are not counted (and
there are a lot of them) so the number of HTTP accesses are in reality
higher still.

>
> Other intriguing figures come up if we consider the robot in "acquisition"
> mode, meaning that it actually downloads all the documents behind the URLs.
> Assuming an average size of 10Kbytes per document, this makes 10Gbytes a day
> of data transfer. So the robot must have a bandwidth (speed of handling all
> the incoming data) of 100Kbytes/s (assuming 100,000 a seconds for a day).
> And then, the robot will constantly consume 1Mbit/s of network bandwidth.
>
> Is that the reality for the existing robots that cover the whole web ?

I'd guess that those figures are a bit high, but I've not made any
estimates on that.

>
>
> --
> Francois.Rouaix@inria.fr Projet Cristal - INRIA Rocquencourt
>
>
>

Sigfrid

________________
Sigfrid Lundberg siglun@munin.ub2.lu.se
Lund university library
Netlab, PO Box 3, S 221 00 Lund +46 222 36 83
Sweden