Re: Keyword indexing

Ricardo Eito Brun (x8035952@fedro.ugr.es)
Thu, 20 Jun 1996 15:51:00 +0100


Gerard Salton and Michael McGill
wrote a book (among several other
interesting works) in 1983 with the tittle
Introduction to Modern Information Retrieval.
Here you can find how to index full text documents.
The main techniques are related to the frequency of
words appearing in the text.
If a word appears so many times in all the
documents of your database, it will be a bad keyword
because it is not a good discriminator between documents.
I think Salton's book is not available through the WWW,
but there is another interesting book written by
Van Rijsberger which can be accessed through the WWW;
take this URL: http://www.dcs.gla.ac.uk/publications/
This book offers a similar content of Salton's one,
and although it was published earlier (about 1979
the second edition), the 'state of the art' in
'automatic indexing' has't evolved to more complex
systems.

Good luck.

At 10:00 1/01/70 +1000, you wrote:
>At 05:29 PM 6/18/96 -0700, you wrote:
>
>>>> My problem is deciding exactly *which* words are important to index, and
>>>> how to store such a huge amount of data in a manner that will be easily
>>>> accessible for a search engine.
>>
>>Maintaining a list of important keywords by hand is probably not a good
>>idea. A better idea would be to use machine learning techniques to
>>automatically classify pages as relevant to computer science (or whatever
>>topic). Look at Gerard Salton's work for starters.
>
>I've never heard of machine learning technicques for classification by subject.
>Is the work of the person you mentioned, Gerald Salton, available on the Web?
>Or he the authour of a book? If you happened to know the name of the book,
>or some way of finding it, I'd be very appreciative.
>
>Thanks,
>David
>
>=-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-=
> David Reilly,
> Computer Programmer, dodo@fan.net.au
> http://www.fan.net.au/~dodo s1523@sand.it.bond.edu.au
>=-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-=
>
>
>