Re: Client Robot 'Ranjan'

Reinier Post (reinpost@win.tue.nl)
Fri, 21 Jun 1996 14:11:53 +0200 (MET DST)


You (Brenden Portolese) write:

>>>what are the implications of the general public having a Webcrawler?

Hmm, that's impossible; they don't own a machine with 6G internal memory,
over 100G of disk space, and supersonic connectivity.

>This is a technical discussion of robots. As robot developers, in your
>opinion, are robots efficient enough, and is the Web developed enough, that
>the presense of a web-crawler on every PC on the Net does not pose a serious
>problem to the Net in general. More specifically, with regards to bandwidth
>and possibly even detrimental to the proper functioning of servers themselves.

I suspect the answer contradicts your expectations.

A few years back, we released a built-in search function in XMosaic, which
by far the most pupular WWW browser at the time; only a small fraction of
its users installed our version, and we have no idea how many people actually
used the search function, but it may have been the first widely available
client-based Web roamer for searching purposes. It didn't even have a memory,
let alone indexing, so it was ridiculously inefficient. For some time, we
feared that it might stifle the Net with wasted HTTP traffic.

Nothing happened of the kind. This is partly explained by the fact that
few people ever discovered the function in the first place. But I think
the fundamental reason is that personal Web robots can *never* compete
with dedicated WWW indexing servers elsewhere, no matter how sophisticated
the software is. On a standard PC, it's simply impossible to index more
than a tiny fragment of the Web, and even that will take a few weeks.

So the waste of bandwidth that we all fear from personal robots causes them
to perform so poorly that they are hardly ever used. The net result is that
the lost bandwidth is minimal. It's only when herds of people don't realise
the poor performance, or when the robots run *by default*, without the user
even realising what is going on, that we may expect problems. Or when a
personal robot is developed that can outperform a global WWW index for a
certain purpose.

This would only be possible if for this special purpose, roaming a very small,
well-selected fragment of the Web is sufficient, and if the results will
be very different for different users. For example, the idea of a subject-
specific index of a few pre-selected sites doesn't qualify, as the service
is of interest to *all* users interested in that subject, and
consequently, much more effective when implemented at one of the sites,
instead of the user's machine. Site-wide indexes are common, but they
rarely operate by means of robots, and if they do, the robot is operated by
the site maintainer, so abuse of bandwidth is not an issue. Users find a
site in a global WWW index (or otherwise) and search the site using the site's
own index; users cannot beat that by making their own index. Some articles
suggest that personal 'agents' roaming the Web could be effective when using
AI-based 'personal interest profiles'. This is hogwash; assuming that such
profiles make any sense at all, they are best used against a global WWW index.

A client-based robot can still be fairly useful: on very recent information,
that global or site-wide WWW indexes haven't had time to find yet.

So, in conclusion, past experience appears to indicate that personal
Webcrawlers do not pose a threat to the Web, and that, paradoxically,
this is *owing to* their ineffective use of bandwidth.

-- 
Reinier Post						 reinpost@win.tue.nl
a.k.a. <A HREF="http://www.win.tue.nl/win/cs/is/reinpost/">me</A>