- index all HTML-documents. We dont know yet what kind of
indexing (parsing entire HTMLs...) and result-
valuation we will need to use;
- provide a flexible and scaleable interface to the indexed
information;
The Index-part is the difficult one. We have a Linux 1.3.93 and
the following wishes:
- System maintenance should be low-demanding:
- disk-use should be minimal and efficient
- network traffic should be low, and bandwidth minimal
- Maintenance (configuring, updating...) should be as
minimal as possible. Probably the Webmaster should be
able to maintain the robot.
- A feature should be kept for the Web page owner(s) to
- add URLs to the searchable database.
- We want an advanced search query interface where the users have
maximum control over the enumeration of search results.
The List of Robots
(http://webcrawler.com/mak/projects/robots/active.html)
enumerates a few engines, which specifically focus on
community- or topic-specific collections of HTML objects:
Harvest, Peregrinator (sources not available) and HI Search.
HARVESTs motivation reflects ours, as it is indexing community-
specific collections, rather than locating and indexing all objects
that can be found. But I see a possible drawback in choosing Harvest:
Our operating system - Linux 1.3.93 - is not supported by Harvest.
Configuring the robot, and keeping it in the air WITHOUT much
maintenance, looks like a hard job.
We want to make a good choice, and your suggestions and discussion
are highly appreciated. Thank you.
-fred melssen-
------------------------------------------------------------------------
Fred Melssen | Manager Electronic Information Services
P.O.Box 9104 | Centre for Pacific Studies | Phone and fax:
6500 HE Nijmegen | University of Nijmegen | 31-024-378.3666 (home)
The Netherlands | Email: melssen@aznvx1.azn.nl | 31-024-361.1945 (fax)
| http://www.kun.nl/~melssen | PGP key available
------------------------------------------------------------------------