Re: Search accuracy

Nick Arnett (narnett@Verity.COM)
Thu, 4 Apr 1996 16:04:30 -0800


>Nick, could you please expand this last paragraph. I'd love to hear your
>ideas. Sounds fascinating.

If you assume that Web authors are creating links that have some sort of
conceptual logic behind them, then a robot may be able to infer knowledge
from the choice of documents that are linked, as well as the locations of
the links and the link texts. For example, look at the highly structured
pages of Yahoo, where the text of each link actually describes the pages to
which it is linked. Let's take "Java," for example. One might guess that
if a robot examined the pages linked to the word Java, it might often find
terms such as "object-oriented" and "Sun Microsystems," to name a couple.
Although your software might not be able to figure out the nature of the
conceptual connections among them, it could observe the connection and use
them as evidence when someone searches on "Java." That is to say, when
someone would search on "Java," documents that contain "object-oriented"
and "Sun Microsystems" would be ranked as more relevant. This assumes a
completely automated approach. Probably more practical would be to present
the results of the analysis to a human editor, who could tune the
knowledgebase.

This could help address the big problem with semantic networks and other
sorts of conceptual knowledgebases -- automation of their creation and
maintenance. There have been two general approaches. One is to
automatically extract a semantic network from a dictionary. This works
well within the limits of a dictionary's vocabulary, but many, if not most
of the interesting words, especially for new information, (the proper noun
"Java," for example) aren't in dictionaries. The alternative is to build
your own knowledgebase from the ground up, but that's not easy. The
results are effective, but few people have the resources to build robust
ones.

Even more interesting to me (and perhaps more practical) is the idea of
using a robot to extract subjective information from the Web. For example,
if you could accurately recognize people's "my favorite links" lists, you
might be able to come up with a pool of opinions rapidly. Then you could
do the kind of analysis that Pattie Maes has been doing at the MIT Media
Lab -- if you like "A" and "B" and I like "A", then my agent will bring "B"
to my attention.

Of course, as one of our engineers observed, we might discover that the
most talked-about concept on the Web is "click here" or perhaps "under
construction." ;-)

Nick