Re: Should I index all ...

Trevor Jenkins (tfj@apusapus.demon.co.uk)
Wed, 10 Jul 1996 08:45:24 +0000


> > > Unfortunately, it also prevents users from finding "The United States of
> > > America" and similar phrases, which tends to confuse people quite a bit.
> > > Stop words can be stripped from the query if necessary to avoid the problem
> > > you describe. Our engine does so as a matter of routine when you use our
> > > free text parser, for example.
> >
> I wasn't going to respond to the "United States of America"
> problem, but the bold statement that stopwords prevent users
> from finding such an important statement as this can't be
> left unchallenged.

Neither was I until now.

> Stop words do not necessarily prevent users from finding a phrase
> such as "The United States of America". If "the" and "of" are not
> indexed, then occurrencs of the phrase are indexed as "United
> States America". The query language for the engine also strips
> stop words out of the query, so that the query also becomes
> "United States America." Query matches index and many happy hits
> emerge.

If we are trading in ancedotal search terms try the phrase "Lloyds of
London". If the conventional rules are followed then "of" should be
removed via a stop list. This creates an ambiguity by producing the
new phrase "Lloyds London". However, there two separate institutions
that match: there is Lloyds of London (the desired company) that is
an insurance company and there is Lloyds London (the unwanted
company) that is a major bank. Just to complicate matters Lloyds
(bank) will broker insurances at Lloyds of London. Okay so I find
"Lloyds of London" but it is in amongst the noise created not by the
presence of "Lloyds London" but by the removal of a very significant
word.

> We return now to our regular, robot-oriented programming. :-)

But this opens up the wider issue of counter intuitive search
syntax. Stopwords are an nuisance. They were originally propose back
in the days when 5Mb disks were consider outrageously large and very
expensive. The programmers of search systems wanted to reduce the
disk space requirement for their inverted lists so they removed the
common words. However, this confuses users who are not information
scientists. Also, the use of stoplists is Anglo-American centric.
Consider the common English words "the", "or" which are also valid
foreign words---being the words tea and gold in French. How is your
indexing robot to detect that the strings "the" and "or" are English
or French? Maybe I work in a commodities market and have to write
documents (e.g. web pages) in parallel languages. Alternatively, I
might work in a Swedish company and have to write technical documents
in Swedish but use English words when there is no equivalent Swedish
term. Will you indexer still remove these supposedly "common" terms?

The other side of the coin is the abuse of the rules of punctuation.
When I use Alta Vista I am force to think about the space character.
In Latin-based languages the space is the adjacency operator. But not
in AV. In AV the adjacency operator is ".". Try explaing this to a
user who is not computing scientist and observe the look of
bewilderment on their face.

If you reach this far into my diatribe then you will realise that my
answer to the original question is "Yes, you should index all terms
in the document". I know it can be done becuase I worked on a product
that did just that. The cost of including all the terms was very low.
The disk space requirements were minimal and the programming was
easier. This latter point was a major selling point---there was less
chance of a programming error.

The only valid case I have ever seen for stoplists is in applications
storing secret and confidential data. There the stoplists are at the
*user* level not at the document level. Though in thise case it might
be easier to employ go-lists---what is it that the user can see
rather than what should they not see.

Regards, Trevor.

--

Procrastinate Now!