Re: Links (don't bother checking; I've done it for you)

Darrin Chandler (dchandler@abilnet.com)
Fri, 29 Mar 1996 13:51:42 -0700


At 10:53 3/29/96 -0800, you wrote:
...
>> Well, if you were to analyze the comments of a web page and compare the
>> ratio of unique words to total words, you could cull a large percentage of
>> these types of pages. Even better, you could use that metric to decide
>> whether your indexer should include comments, which means you can still
>> index the page, but without the bogus keywords.
>
>Ugh. That is a bad heuristic. I use keywords to 'cover the bases' for
>search engines. IOW I try to guess possible mis-spellings, alternative
>spellings, related concepts that are likely to be searched on etc. The
>result is perhaps twenty to fifty unique words that act as a 'wide net'
>for those looking for information covered by the site. The primary defect
>of whole body text indexing is that the people issuing search requests
>are frequently very poor at actually generating *good* search requests
>that will match all relevant information. You don't believe me?
...
>The individual words usually cough up very different sub-sets of pages
>related to rabbits. A *good* search request would look for all of them - in
>the absence of such searches, I would keyword a page to all of them. And I
>would be correct to do so. But your unique words rejection heuristic would
>likely deny the page.

What I proposed was that the page *would* be indexed, but the keywords in
comments or META tags may be discarded. Indeed, this would be detremental in
examples such as you gave. However, not everyone is as conscientious when
adding keywords to their html. In total, I believe the signal:noise ratio
would increase using my hueristic, even though some good references would
not be returned.

Some time soon I'll be ready to test out my ideas. At that time I will
certainly post my results, and probably give the URL for people here to see
for themselves. In the meantime, I welcome further discussion...
______________________________________________

_/| _| _| _|
_/_| _| _| _| _| _|_|_|
_/ _| _| _| _|
_/_|_| _|_|_| _| _| _| _| _| _|
_/ _| _| _| _| _| _| _| _| _|
_/ _| _|_|_| _| _| _| _|_| _|_|_|
_|
_|_|_|

Darrin Chandler, Duke of URL
Ability Software & Productions
Email: dchandler@abilnet.com
WWW: http://www.abilnet.com/
______________________________________________