Re: Keyword indexing

Martijn Koster (m.koster@webcrawler.com)
Tue, 18 Jun 1996 06:35:08 -0700


At 10:00 AM 1/1/70, David Reilly wrote:
>I'm currently developing a new spider (IntelliAgent) whose purpose is to
>find new internet resources within a specific subject domain (for example,
>computer programming), and then create an index as a reference for a future
>search engine.

I always wonder about subject-specific robots. Are you going to crawl
the entire web, dropping anything you see that's not specific to your
subject on the floor? That can easily get wasteful and unproductive
(would waste network, would take forever and you'd miss many).

It might make more sense to seed the robot with a list of known (or
suspected) relevant pages, maybe the output of a search on large
indices on "computer science". You can then decide which of the
resulting links are relevant pages, and stop recursing when you
see the relevance end.

Still I can't escape the feeling that there must be a better way
to find _new_ resources for a subject, other than blind crawling
on the Web. Like monitoring newgroups, mailing lists, parse specific
pages etc, not to mention human editorship :-)

>My problem is deciding exactly *which* words are important to index, and
>how to store such a huge amount of data in a manner that will be easily
>accessible for a search engine.

Well, from an engineering perspective it is useful to split the
discovery, harvesting, and indexing. In other words, once you
know which pages you want, pull them accross, store them in full
(maybe preparsed and/or compressed), and let your information-retrieval
engine decide how to index it. That way you can switch engines, or
policies, without needing to re-crawl, and because you retain the
full text your engine can make the most of it.

>Has anyone got any suggestions as to how to go about this? Should I maintain
>a list of keywords which my spider will index, or should I index every single
>word (including small ones such as if, the, and, but, etc...)?

That depends on the features you want to offer, your retrieval engine,
your disk space situation, so only you can determine that :-)

Having the small words is useful for phrase searching, as in
"The young and the restless", but not all engines support that.

-- Martijn

Email: m.koster@webcrawler.com
WWW: http://info.webcrawler.com/mak/mak.html