> Well, if you were to analyze the comments of a web page and compare the
> ratio of unique words to total words, you could cull a large percentage of
> these types of pages. Even better, you could use that metric to decide
> whether your indexer should include comments, which means you can still
> index the page, but without the bogus keywords.
Do you need this complexity? I guess, and it is only a guess, that people
assume a 'WAIS' like behaviour in the weighting. I.e. the number of times
*that* word has appeared over the total number of words in the document.
(If I've got this wrong you can correct me privately :-). A linear
relationship therefore...
But does it need to be linear? How does a
log (*that* word) / total number
behave? Artificially loading the document with keywords then becomes
counter-productive as you are also increasing the total number of words.
Time for some back of envelope work I think....
> What I proposed was that the page *would* be indexed, but the keywords in
> comments or META tags may be discarded.
I would vote for ignoring comments but surely you *must* include the
<META HTTP-EQUIV="keywords" ....>
Regards,
Martin Kiff
mgk@newton.npl.co.uk / mgk@webfeet.co.uk