There is a feature in many word processors - Word comes to mind - as part of the
"Grammer Check" section. After executing a grammer check, Word displays statistics about
the document including some proprietary grammatical ranking scales and the score of the document on
each. I don't recall the actual name of these scales, but they were created by the dictionary/thesaurus
/encyclopedia people to rank the 'literary level' of a document. Typically the score represents
a grade level, like 9.4, indicating that the document is constructed and written in a manner
indicative of a ninth grade reading level.
I suggest that it would be possible to run these algorithms on web pages to determine
if the page contained sufficient structure to be worthy of indexing. Obviously, a page
like the 'red herring' page would be thrown out unless the author spent time to construct
the page so it would appear as normal sentences.
Some analysis using these algorithms could probably help determine at what level a document
is 'useless' for information purposes. 'Useless' documents should still be indexed, but the default
webcrawler query might exclude them unless instructed not to. This would be necessary if
a person were searching for a foreign language or math/science type of page.
Just some thoughts.
Chris Cobb