Indexing a set of URL's

Fred Melssen (MELSSEN@AZNVX1.AZN.NL)
Thu, 02 May 1996 16:04:54 +0100 (MET)


We have a manually crafted list of topic-specific URLs.
We maintain and document this list by hand. In order to
facilitate a public Boolean keyword-searching to all URLs, we
want to implement a robot for this purpose. This robot has to:

- index all HTML-documents. We dont know yet what kind of
indexing (parsing entire HTMLs...) and result-
valuation we will need to use;
- provide a flexible and scaleable interface to the indexed
information;

The Index-part is the difficult one. We have a Linux 1.3.93 and
the following wishes:

- System maintenance should be low-demanding:
- disk-use should be minimal and efficient
- network traffic should be low, and bandwidth minimal
- Maintenance (configuring, updating...) should be as
minimal as possible. Probably the Webmaster should be
able to maintain the robot.
- A feature should be kept for the Web page owner(s) to
- add URLs to the searchable database.
- We want an advanced search query interface where the users have
maximum control over the enumeration of search results.

The List of Robots
(http://webcrawler.com/mak/projects/robots/active.html)
enumerates a few engines, which specifically focus on
community- or topic-specific collections of HTML objects:
Harvest, Peregrinator (sources not available) and HI Search.

HARVESTs motivation reflects ours, as it is indexing community-
specific collections, rather than locating and indexing all objects
that can be found. But I see a possible drawback in choosing Harvest:

Our operating system - Linux 1.3.93 - is not supported by Harvest.
Configuring the robot, and keeping it in the air WITHOUT much
maintenance, looks like a hard job.

We want to make a good choice, and your suggestions and discussion
are highly appreciated. Thank you.

-fred melssen-

------------------------------------------------------------------------
Fred Melssen | Manager Electronic Information Services
P.O.Box 9104 | Centre for Pacific Studies | Phone and fax:
6500 HE Nijmegen | University of Nijmegen | 31-024-378.3666 (home)
The Netherlands | Email: melssen@aznvx1.azn.nl | 31-024-361.1945 (fax)
| http://www.kun.nl/~melssen | PGP key available
------------------------------------------------------------------------