I'm interested in things like connectivity statistics (e.g.,
given two pages with _some_ path between them, what's the average
shortest length between them)? Stuff like that.
Like several others here, I'm working on a domain-specific (domain of
knowledge, not of the Internet) robot. My idea so far is to:
1. Create a large set of known-relevant top level pages.
2. Index them, retrieve their children and qualify the children for
relevance to the domain of interest; index the relevant children.
3. Recurse on the relevant results from step 2; recurse only
to a limited depth on the irrelevant results from 2.
For example, allow up to 1 or 2 or whatever irrelevant docs
before discontinuing recursion on that sub-graph.
This should work if a reasonably large proportion of all relevant
docs can be reached via no more than n irrelevant ones. Of course
there will be misses, but if I had some data, I could choose the
search parameter more intelligently.
Any suggestions much appreciated.
-- Fred