Re: word spam

Benjamin Franz (snowhare@netimages.com)
Tue, 16 Apr 1996 06:21:40 -0700 (PDT)


On Mon, 15 Apr 1996, Ken Wadland wrote:

> >> Sadly, some index engines are incluing grammatically correct pages
> >> that are not really pages. For example, use the Alta Vista engine and
> >> look for "posix".
>
> >posix AND NOT (posix.pl)
>
> This still gets 20,000 hits!

Yup. That is because you didn't give me any more information about what
you were *exactly* looking for. Assuming you are looking for the standards
docs: Searching on "posix AND NOT (posix.pl) AND compliance AND IEEE AND
fips" and telling Alta Vista to list 'fips' first resulted in 167
documents, starting with the standards docs for POSIX at the NIST. It took
me a few iterations to tune the search - but it still took under 5 minutes
(most of that time because of poor network performance at alternet
stalling my page loads for long periods of time).

> Yes, you can correct this particular case with a revised query; but,
> wouldn't it be nice if the search engines were a little smarter about the
> context of the word in the document?
>
> As another example, try searching for "HTML". AltaVista gets 900,000
> matches. I have yet to find a query for documents about HTML which works on
> any of the search engines. For example, excluding "(HTML)" excludes all
> documents!
>
> I had one heck of a time finding the RFC for HTTP because of this.

Using a search engine to look for 'http' (or 'HTML' or 'gif' or 'jpg' or
any other string that appears as a structural element of HTML markup) on
the WWW is like searching for an acronym that matches an article of speech
such as 'AND'. It is simply going to appear too many times in too many
places to be useful as a search criteria and the search engines with
rightly tell you "I don't think so." This is true in large part because of
the huge number of pages with broken HTML resulting in out of context
'structural' strings. Alta Vista reports a mere 83 million hits on
'http'. So you search for some other feature of the information that is
not so common: A search for 'RFC' with the keyword 'hypertext' did the
trick. The sixth item returned was a pointer to the March 2nd 1.0
http draft.

A better approach for any WWW related standards is of course to go
directly to http://www.w3.org/.

The search engines are powerful enough to find information rapidly and
accurately, but you do have to *specify* the information you are looking
for. It is no different than using a modern library catalog.

--
Benjamin Franz