Heuristics....

Martin Kiff (MGK@NEWTON.NPL.CO.UK)
Sat, 30 Mar 1996 14:19 BST


.... into the stream of discussion about heuristics (which is a way
of saying that I've lost track of who has said what... apologies)

> Well, if you were to analyze the comments of a web page and compare the
> ratio of unique words to total words, you could cull a large percentage of
> these types of pages. Even better, you could use that metric to decide
> whether your indexer should include comments, which means you can still
> index the page, but without the bogus keywords.

Do you need this complexity? I guess, and it is only a guess, that people
assume a 'WAIS' like behaviour in the weighting. I.e. the number of times
*that* word has appeared over the total number of words in the document.
(If I've got this wrong you can correct me privately :-). A linear
relationship therefore...

But does it need to be linear? How does a

log (*that* word) / total number

behave? Artificially loading the document with keywords then becomes
counter-productive as you are also increasing the total number of words.
Time for some back of envelope work I think....

> What I proposed was that the page *would* be indexed, but the keywords in
> comments or META tags may be discarded.

I would vote for ignoring comments but surely you *must* include the
<META HTTP-EQUIV="keywords" ....>

Regards,
Martin Kiff
mgk@newton.npl.co.uk / mgk@webfeet.co.uk