Re: web topology

Nick Arnett (narnett@verity.com)
Wed, 26 Jun 1996 17:23:05 -0700


>Where can I find some detailed facts about over-all web topology?
>
>I'm interested in things like connectivity statistics (e.g.,
>given two pages with _some_ path between them, what's the average
>shortest length between them)? Stuff like that.
>
>Like several others here, I'm working on a domain-specific (domain of
>knowledge, not of the Internet) robot. My idea so far is to:
>
>1. Create a large set of known-relevant top level pages.
>
>2. Index them, retrieve their children and qualify the children for
>relevance to the domain of interest; index the relevant children.

This begs a big question -- how will you measure the relevancy? Given the
nature of Web links (I discovered from trying this sort of thing) you have
to take a very fine-grained approach to evaluating whether or not to follow
a link. It appeared to me from a brief shot at this kind of robot that
you'd need to make the recursion decision based on following a series of
links, not based merely on the contents of individual pages.

The difficult problem is that there are many Web pages that touch a variety
of subjects -- often the most interesting jumping-off points cover several
topics. Traditional approaches to relevancy ranking tend not to take this
into account; the relevancy score would not reflect the usefulness of some
of the links in a document that might be quite relevant.

In short, it became clear to me that this kind of robot will take some time
to develop and maintain. It'll also follow many dead ends -- even the best
relevancy ranking systems today have an accuracy of perhaps 80 percent or
so. Multiply the 20 percent error to take into account recursion and the
robot will spin its wheels a bit.

Lots of trial and error is needed, but it's very interesting work, I
suspect. Verity might have an interest in supporting such research.

Nick