Re: word spam

arutgers (arutgers@helix.net)
Sat, 13 Apr 1996 15:46:29 -0700


Hi

chris cobb wrote:
>
> This may have been discussed in past sections, but it comes to mind regarding the use
> of large blocks of random or repetitive keyword text in web pages - either to obstruct a crawler's indexing
> mechanism or to increase the ranking of a page.
>
> There is a feature in many word processors - Word comes to mind - as part of the
> "Grammer Check" section. After executing a grammer check, Word displays statistics about ...
>
> Chris Cobb

Nice idea, if you think people will always use good english grammar and
sentences for their web pages, but often in things like a list or an index there
is no need for good grammar and point form works far better for many things.(When
was the last time you saw "Click here to go to the ________ page."?) For example,
look at the search results from any robot such as webcrawler, they are very useful
but there is little english grammar and few if any complete sentences. Or any
companies web page, it probably has a series of links at bottom that reads
"Technical Support; Ordering; On-line Catalog", again not much of a sentence. Also
the 'ideal' web page is not too long and probably has some info in point form,
including the title and the primary heading on the page.
Then there is the language and identifying issues. First you can have a
page in french and not use the french character set tag, this would immediatly get
a very low score and be discarded, despite its use to people fluent in french.
Second identifying is a problem because the computer has to reconize a word, (ie
noun, verb, etc.) in order to check the grammar. There are lots of company and
product names that would confuse it and lower the score. (Is 'descent' a proper
noun as in the game, or a verb?, and should it be indexed?) Grammar checkers are
great for essays, but web pages are not essays.
You would however be able to use parts of grammer checkers diferently
though. You could have a 'literacy level'. The average number of characters in a
word over a document is usually about 4 for a high school student and if it's >5
the writter probably went to university. Again though this is affected by leaving
out the 'the's and other parts of a proper sentence or if your web page has the
word 'supercalifagilisticexpaldocious'(sp?; from Mary Poppins). This is actually
part of how the grammar checkers assign grade levels. The other use would be to
pick out good keywords. Such as in the 'descent' example a modified grammar
checker could decide if 'descent' refers to the game or the verb, by looking at
the rest of the sentence, captialization etc. If it's the game it's worth
indexing.
As far as developing something to ignore large blocks of random or
repeditive text, good idea.

Andrew