Re: Should I index all ...

Terry O'Neill (toneill@mariner.com)
Tue, 09 Jul 1996 12:26:08 -0600


Chris Crowther wrote:
>
> Hi,
>
> > Unfortunately, it also prevents users from finding "The United States of
> > America" and similar phrases, which tends to confuse people quite a bit.
> > Stop words can be stripped from the query if necessary to avoid the problem
> > you describe. Our engine does so as a matter of routine when you use our
> > free text parser, for example.
>
> British rebuke here: And? :)
>
> Question I've been meaning to ask - could someone tell me where to
> find the Guidlines for Robots? I really need the ones that give the
> context for robots.txt
> Chris,
> chris@jm-crowther.co.uk
> www.dungeon.com/~jmcrowther/chris.html
> ChegHchu djajVam djajKak!

I wasn't going to respond to the "United States of America"
problem, but the bold statement that stopwords prevent users
from finding such an important statement as this can't be
left unchallenged.

Stop words do not necessarily prevent users from finding
a phrase such as "The United States of America". If "the"
and "of" are not indexed, then occurrencs of the phrase are
indexed as "United States America". The query language for
the engine also strips stop words out of the query, so that
the query also becomes "United States America." Query matches
index and many happy hits emerge.

So in this case at least, the issue isn't a case of not being able
to find the desired phrase. Instead it is a problem of polluted
results, since a search for "The United States of America" would also
return any occurrences of the shorter "United States America" that
may have also occur in the database.

We return now to our regular, robot-oriented programming. :-)

Terry O'Neill
mariner.com