Re: Search accuracy

Ellen M Voorhees (ellen@scr.siemens.com)
Fri, 5 Apr 1996 07:59:11 -0500 (EST)


> >All in all, I prefer approaches like PLS' where the document is
> >subjected to a statistical analysis, one where each word is
> >indexed as well as its relationship with all the other words in
> >the document. Along with the standard arsenal of boolean,
> >fielded, and adjacency search features, I feel this represents a
> >very reasonable "middle-ground" for the typical searcher.
>
> I think you're saying two things at once here -- statistical analysis
> helps, but a variety of algorithms/operators is important. This would
> seem to be quite true; as was said here earlier, the more evidence, the
> better. Statistic analysis (aside from its indexing speed and size issues)
> is done on a corpus, not individual documents. This presents the problem
> of combining search results from multiple corpuses. That's not an issue
> until you try to leverage search across a bunch of indexes whose corpuses
> have different co-occuring word frequencies. We find that customers don't
> generally turn on our statistical operators when they're available. Do you
> get better search results with co-occurring word search ("concept") search
> turned on?

In many retrieval systems (e.g., SMART, INQUERY) the functions used
to weight terms include both within-document factors (number of times the
term occurs in the document) and corpus-wide factors (number of documents
in which the document occurs. These systems get much better results
with weighting schemes that include both factors as compared to the
results obtained when using weighting schemes that lack one of the factors.
What are TOPIC's statistical operators?

The combination of search results from multiple corpora is receiving
attention in the text retrieval research community. The TREC
(Text REtrieval Conference) workshop series sponsored by NIST has a
track devoted to the topic (the Database Merging Track) that I lead.
There are also a couple of papers on the topic in the SIGIR-95 proceedings:
one paper by Jamie Callan and his colleagues at UMASS and one by me
and my colleagues at Siemens.

> >Of course, most of this is more than a little academic since
> >the *vast* majority of all searches initiated online are for
> >single keywords rather than more complexly constructed,
> >multi-termed queries.
>
> I'm not sure this ends up being true; even though each search may add just
> one term, people often are building multi-word searches through trial and
> error. There aren't many one-word searches that yield useful results on
> the big Web indexes, in my experience.
>
> I suspect that search is like page layout when PageMaker came out. No one
> thought they'd need to learn typesetting "language," but they did. Today,
> people don't think they'll learn query languages... but I predict that the
> basics of a query language will be familiar to most Internet users within a
> few years. Of course, the question is, what query language... ;-)
>
> Nick

I disagree that people are going to learn query languages to search the
Internet. The statistical systems mentioned above do a very good job
of retrieving relevant documents using English phrases as a query.

Ellen Voorhees
Siemens Corporate Research, Inc.
ellen@scr.siemens.com