Re: Search accuracy

Benjamin Franz (snowhare@netimages.com)
Mon, 1 Apr 1996 15:13:30 -0800 (PST)


On Mon, 1 Apr 1996, Nick Arnett wrote:

> >On Fri, 29 Mar 1996, Darrin Chandler wrote:
>
> >The individual words usually cough up very different sub-sets of pages
> >related to rabbits. A *good* search request would look for all of them - in
> >the absence of such searches, I would keyword a page to all of them. And I
> >would be correct to do so. But your unique words rejection heuristic would
> >likely deny the page.
>
> You're saying that a search on "buns" should return pages about rabbits? ;-)

Actually, yes. ;-). Those who frequently talk about rabbits use 'buns' as a
synonym for rabbits. And a search on buns on Alta Vista *does* return pages
involving rabbits. Along with discussions of food, hair, long distance
running in cold weather, as well as human and non-human anatomy. This is
where skill in constructing a search to exclude things that are *not* of
interest comes in handy.

> The nit I'd like to pick here is that you're describing good recall
> (finding all of the relevant documents), which is only half of the search
> accuracy problem. The other half is precision, which is finding only
> relevant documents. A thesaurus/dictionary-based semantic network could
> return all of the documents that you describe... but the problem would
> remain that it would *also* return many, many other documents that have
> words with some sort of linguistic connection to these.

Yup.

> Balancing precision and recall is the big problem in search. Robots that
> compile additional evidence can help in ways that go beyond just indexing
> the words. For example, capturing HTML zone information can help score
> documents based on where words appear.

The general problem is that while as an author I can tell the search engines
that a list of words are relevant to the topic of my page, it is incumbent on
the *searcher* to exclude irrelevant topics - because I have no way to
determine that as an author. If the search engines even *allowed*
specifiying a list of irrelevant but potentially searched keywords, it
would help. So when someone searched 'buns AND pictures' I could rank my
pages *lower*. But even that marginal assistance is not available with
the current search engines.

Parsing the HTML structure simply will not (cannot) resolve the search
problem of 'buns'. In *each* of the ones listed in my example, 'buns' are in
fact the highly relevant element of each page - but only a sub-set are
relevant to *me* as I am only interested in one *kind* of buns (Well, ok,
I'm interested in some of the others. But they still are not relevant to my
search for information on rabbits).

-- 
Benjamin Franz