> At 10:56 3/29/96 +0100, you wrote:
> >Then, at the end of his page he had the following keywords repeated
> >
> >California
> >Northwest
> >Animation
> >Promotion Web Site
> >Development Web Site
> >Knitter Web
> >Money
> >
> >about 50 times each..
> >
> >okay, so the action to take in his case is obvious (deindex any siGHt(e)
> >beloning to him), but what is the general case to stop this? It's very
> >difficult as far as I can see. It's deliberate worthless junk, trying to
> >get in at the level of people who are providing worthwhile information
> >about california/web sites etc.
>
> Well, if you were to analyze the comments of a web page and compare the
> ratio of unique words to total words, you could cull a large percentage of
> these types of pages. Even better, you could use that metric to decide
> whether your indexer should include comments, which means you can still
> index the page, but without the bogus keywords.
Ugh. That is a bad heuristic. I use keywords to 'cover the bases' for
search engines. IOW I try to guess possible mis-spellings, alternative
spellings, related concepts that are likely to be searched on etc. The
result is perhaps twenty to fifty unique words that act as a 'wide net'
for those looking for information covered by the site. The primary defect
of whole body text indexing is that the people issuing search requests
are frequently very poor at actually generating *good* search requests
that will match all relevant information. You don't believe me?
Try searching for information related to rabbits. As sample searches try
buns
bunny
bunnies
rabbit
rabbits
bunnyrabbits
"bunny rabbit"
"bunny rabbits"
buny
bunnys
devilbunnies ( >;-) )
The individual words usually cough up very different sub-sets of pages
related to rabbits. A *good* search request would look for all of them - in
the absence of such searches, I would keyword a page to all of them. And I
would be correct to do so. But your unique words rejection heuristic would
likely deny the page.
-- Benjamin Franz