word spam

chris cobb (c-cobb@ix.netcom.com)
Fri, 12 Apr 1996 00:22:55 -0400


This may have been discussed in past sections, but it comes to mind regarding the use
of large blocks of random or repetitive keyword text in web pages - either to obstruct a crawler's indexing
mechanism or to increase the ranking of a page.

There is a feature in many word processors - Word comes to mind - as part of the
"Grammer Check" section. After executing a grammer check, Word displays statistics about
the document including some proprietary grammatical ranking scales and the score of the document on
each. I don't recall the actual name of these scales, but they were created by the dictionary/thesaurus
/encyclopedia people to rank the 'literary level' of a document. Typically the score represents
a grade level, like 9.4, indicating that the document is constructed and written in a manner
indicative of a ninth grade reading level.

I suggest that it would be possible to run these algorithms on web pages to determine
if the page contained sufficient structure to be worthy of indexing. Obviously, a page
like the 'red herring' page would be thrown out unless the author spent time to construct
the page so it would appear as normal sentences.

Some analysis using these algorithms could probably help determine at what level a document
is 'useless' for information purposes. 'Useless' documents should still be indexed, but the default
webcrawler query might exclude them unless instructed not to. This would be necessary if
a person were searching for a foreign language or math/science type of page.

Just some thoughts.

Chris Cobb