Re: word spam

Nick Arnett (narnett@Verity.COM)
Mon, 22 Apr 1996 19:46:54 -0700


> What started this thread is the observation that smarter indexing
>could result in better query results. Search engines that understand the
>difference between a text word, a title word and an HTML tag will invariably
>return better results for simple queries than one that doesn't. Do you
>disagree with this conclusion?

Our engine does this. Take a look at this:

http://www.verity.com/vlibsearch.html

To search for a word in the title and the body, but weight the results
higher if it's in the title, you'd use a query like this:

[.7](robot <in> title),robot

The implicit weight of a term is .5 -- try different weights for the title
term and you'll see different results. This query looks for the word robot
in the document and the title; if the density of "robot" in the text is
equal, those that have "robot" in the title will be ranked higher. It
stems, too, so you'll also get documents about robotics, for example.

However, I think you'll discover, if you use this quite a bit, that
although it is useful, it isn't as great as you might imagine. For one
thing, your queries can become quite complex if you want to search on a few
terms. On the other hand, the work I'm doing with JavaScript might make it
much easier to set weights in titles and such.

You can also use the "<in>" syntax to search other HTML zones -- HEAD,
BODY, H1, etc. Even wildcards -- "robot <in> h*" will find it in any
heading.

So... better results, yes, in general, maybe. Weighting title words higher
implies putting less emphasis on the document contents, which means you'll
decrease recall in some cases (when titles aren't informative). Even if it
works, will people use it? Is the difference significant? Will the
behavior be unexpected and confuse people who assume they're doing a plain
search of text? Should the default behavior be to automatically weight the
title and headings higher? Maybe. The real question is how much
difference this makes. I'd be curious to hear, especially for real-world
search problems.

By the way, that URL points to a collection of Web-related documents.

Nick

P.S. I'm glad to see Tim and others pointing out that the differences in
generic search accuracy among the top engines are relatively small.