Re: Keyword indexing

David Reilly (dodo@fan.net.au)
Thu, 01 Jan 1970 10:00:00 +1000


At 05:29 PM 6/18/96 -0700, you wrote:

>>> My problem is deciding exactly *which* words are important to index, and
>>> how to store such a huge amount of data in a manner that will be easily
>>> accessible for a search engine.
>
>Maintaining a list of important keywords by hand is probably not a good
>idea. A better idea would be to use machine learning techniques to
>automatically classify pages as relevant to computer science (or whatever
>topic). Look at Gerard Salton's work for starters.

I've never heard of machine learning technicques for classification by subject.
Is the work of the person you mentioned, Gerald Salton, available on the Web?
Or he the authour of a book? If you happened to know the name of the book,
or some way of finding it, I'd be very appreciative.

Thanks,
David

=-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-=
David Reilly,
Computer Programmer, dodo@fan.net.au
http://www.fan.net.au/~dodo s1523@sand.it.bond.edu.au
=-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-=

etc, not to mention human editorship :-)

I agree. Certainly the facility to add new URL's by users might turn up a few.
Interfacing with news servers, and mailing lists, however might make the scope
of the project too large. As for parsing specific pages, what about the
search engines of your competitors ;-)

>>My problem is deciding exactly *which* words are important to index, and
>>how to store such a huge amount of data in a manner that will be easily
>>accessible for a search engine.
>
>Well, from an engineering perspective it is useful to split the
>discovery, harvesting, and indexing. In other words, once you
>know which pages you want, pull them accross, store them in full
>(maybe preparsed and/or compressed), and let your information-retrieval
>engine decide how to index it. That way you can switch engines, or
>policies, without needing to re-crawl, and because you retain the
>full text your engine can make the most of it.

I never thought of that actually. Storing the plaintext would make the
harvesting/discovery modules simpler, and then the search engine indexer
could be developed on another platform. Thanks for the suggestion!

>>Has anyone got any suggestions as to how to go about this? Should I maintain
>>a list of keywords which my spider will index, or should I index every single
>>word (including small ones such as if, the, and, but, etc...)?
>
>That depends on the features you want to offer, your retrieval engine,
>your disk space situation, so only you can determine that :-)

I guess so. I was really interested though on which would be the most efficient,
screening out words I do want, or only selecting words I did want. I'm guessing
that the latter would be better, if I want to make the index subject specific.

>Having the small words is useful for phrase searching, as in
>"The young and the restless", but not all engines support that.

I don't think I'd want to include the, it, and, etc, since I'd like to make
the plaintext as small as possible. I'd probably filter them out, then let
the indexer handle selecting words it wants.

Thanks for your suggestions,
David

=-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-=
David Reilly,
Computer Programmer, dodo@fan.net.au
http://www.fan.net.au/~dodo s1523@sand.it.bond.edu.au
=-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-=