Re: web topology

L a r r y P a g e (page@CS.Stanford.EDU)
Wed, 26 Jun 1996 20:03:32 -0700


>>Where can I find some detailed facts about over-all web topology?
>>
>>I'm interested in things like connectivity statistics (e.g.,
>>given two pages with _some_ path between them, what's the average
>>shortest length between them)? Stuff like that.

My Ph.D. research is gathering web topology to answer questions such as
these. If anyone has interesting things they would like researched in web
topology, let me know, and I might work on it :)

>>
>>Like several others here, I'm working on a domain-specific (domain of
>>knowledge, not of the Internet) robot. My idea so far is to:
>>
>>1. Create a large set of known-relevant top level pages.
>>
>>2. Index them, retrieve their children and qualify the children for
>>relevance to the domain of interest; index the relevant children.
>
>This begs a big question -- how will you measure the relevancy? Given the
>nature of Web links (I discovered from trying this sort of thing) you have
>to take a very fine-grained approach to evaluating whether or not to follow
>a link. It appeared to me from a brief shot at this kind of robot that
>you'd need to make the recursion decision based on following a series of
>links, not based merely on the contents of individual pages.
>
>The difficult problem is that there are many Web pages that touch a variety
>of subjects -- often the most interesting jumping-off points cover several
>topics. Traditional approaches to relevancy ranking tend not to take this
>into account; the relevancy score would not reflect the usefulness of some
>of the links in a document that might be quite relevant.
>
>In short, it became clear to me that this kind of robot will take some time
>to develop and maintain. It'll also follow many dead ends -- even the best
>relevancy ranking systems today have an accuracy of perhaps 80 percent or
>so. Multiply the 20 percent error to take into account recursion and the
>robot will spin its wheels a bit.
>
>Lots of trial and error is needed, but it's very interesting work, I
>suspect. Verity might have an interest in supporting such research.
>
>Nick

I'm a little surprised at all the people developing domain specific robots.
It seems to me that it might be easier and more efficient to just query a
large existing service like Alta Vista instead of indexing yourself.

I know of a web site that moved away from crawling itself to gather info
because it was easier, more efficient, and more complete, to just send the
right queries to Alta Vista. However, domain specific robots might be
able to update faster and have more expensive computational heuristics to
determine relevancy then major search engines. I suppose the legal issues
might be problematic as well if you build your entire site from a search
engine.

-Larry