Netscape Catalog Server: An Eval

Eric Kristoff (eric_kristoff@il.us.swissbank.com)
Sun, 22 Sep 96 17:42:03 -0500


Hi All,

I've been testing the Netscape catalog server and wanted to share my =
experiences so far. A couple of caveats first:
1. My opinion's, not Swissbank's.
2. This is my first pass at it, and it is highly configurable

The text below is from an email I sent to Netscape with suggestions, =
complaints, etc.

Critical Issues with the Netscape Catalog Server:
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
* IT MUST HAVE MULTI-CATEGORY URL filtering capability. The fact that a =
resource can only be assigned to one category is ludicrous. What the =
good is categorizing, then? It renders the browsing capability =
completely useless. If someone knew exactly how my sites are organized, =
then I wouldn't need a search engine, would I? That ABSOLUTELY must be =
fixed.
* Also, on that subject, container searching from general to specific =
should be support. That is if category x.1 is inside of category x, I =
should be able to find something in x.1 by searching x. I shouldn't =
have to drill down. Again, Netscape's implementation has fallen short =
of the mark.

* Although subjective, it seems that the keyword parser in "Search" is =
of much lower quality than that in Excite or AltaVista. I do a search =
with 'x', and then take some words from that result set and re-do the =
search, and it doesn't find any of the docs from the first search.
* For an intranet, it must have the ability to sync up (manually and =
automatically) the taxonomies and schemas from distributed RDS and =
Catalog servers. In an intranet it is safe to assume that the =
administrator may have a good knowledge of the company's information =
domain, and hence make explicitly force the syncing up of those =
databases. Saves an immense amount of time and headaches. Otherwise, =
the browsing capability can be rendered useless with even a slight =
memetic drift between servers.
* Robots on the RDS need a time-to-live parameter in order to prevent =
the death of a server or the filling-up of a hard disk. A final =
fail-safe to prevent a run-away process.
* The user interface for maintaining robot rules sucks. Sure it's Java =
and that's the rage, but it's slow, doesn't update correctly, the scroll =
bars don't work, and it floods virtual memory on an NT box. Totally =
unacceptable. Make it a CGI app, or a Windows app. Must have the =
ability to have: templates, move rules up/down, copy rules. (the rules =
list doesn't update correctly unless you close the browser completely =
and reopen it)
* Your documentation is horrible. It looks like it is about 2 revs =
behind the application. The terminology changes all over the place, and =
the programmer's guide is completely absent from the documentation. IT =
points you to the Netscape site (which is difficult to get to on an =
isolate intranet, like MANY of your corporate customers). And then, the =
documenation is not even on your site. Very sloppy. How am I supposed =
to customize the product if I can't get the requisite docs
* I want to be able to set temp file usage maximums at configuration =
time. Catalog server disk usage is out of control. 11,000 resources =
took over 40 hours and consumed 300 MB of disk space. 240MB of that in =
temporary files. I went with all of the default settings, except that =
RDs were allowed 64K of RAM.
* I want to be able to allocate more than 64K per RD in RAM. This would =
let me built a big, fat, fast search engine for the entire company that =
would minimize the time to search and the drain on the network.

OTHER ISSUES:
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
* Improve the Robot logs to indent differently for the different pieces =
of information. makes it more readable.
* I want to be able to restart robot enumeration if I have stopped it. =
I have no choice except to entirely stop the robot and restart it, which =
is a huge waste.
* Store multiple robot configuration templates per RDS. That lets me =
choose which one I want to use without having to set up another RDS on =
that machine.
* I want to be able to print your help files. The closed pseudo-browser =
thing you do is stupid. I can't find the files manually. I can't print =
them. I can't search them. And I can't find the URL to an outside =
link.
* I want to be able to do basic admin and status reporting from a =
browser, but I don't want it to HAVE to be Netscape 3.0. Netscape 1.1 =
compliance for basic robot stats, etc. would be huge.
* Set up email/pager notification if the robot experiences a problem. =
Such as a full disk for tmp files.
* Include a client customization tool kit. I don't want our search =
engine to say "Netscape" all over it. I want a simple way to plug in my =
graphics, quick-links, "what's hot", etc., to provide real value to my =
users.
* Include a way so that when users log in to catalog server, it =
remembers them and loads up their default search criteria (view by =
settings, order by settings, etc.)
* Include better and more relevant schema examples.
* Add the ability to scan a resource's body text for organizing and =
gathering rules. This would make it much easier to discover patterns in =
my data and reverse engineer an improved taxonomy.
* Filtering???? Your robot stats say "Filtered-at-...". However, =
filtered in or filtered out? Your docs do not indicate a default state =
for incoming resources, and the feedback does not indicate if a given =
resource was filtered into the database or filtered out. Additionally, =
the statitistics your product reports on the Robot Stats do not add up. =
What is really happening there?
* Your regex feature in filtering does not seem to behave as expected. =
Please include more examples in your documentation.
* Add the ability to have multiple directories for tmp file usage.
* Improve your docs. They seem out of date, the terminology changes all =
of the place: import item vs. import agent, registering resources vs. =
importing resources ....... ;Very, very sloppy. Also, include diagrams =
that clearly display the architecture of the system.
* Scheduling needs to be vastly improved. I want the ability to =
schedule importing multi-times per day. Also, I want to be able to have =
different schedule times for different days of the week. Can't do that =
now.
* When pulling up the robot status, it should show the last couple times =
that you pulled it up as well so you could get a quick eyeball of =
progress.
* Have a way to show real-time, continous progress of the robots. That =
would be very helpful for determining how to grow my RDS network and =
distribute servers. It would also look really cool!
* Your explanation of DNS translation is horrible. Again, you used =
about 5 different terms for the same example, completely confusing the =
issue. I tried the translation either way: realname->displayname, and =
displayname->realname and neither work. Catalog server always shows the =
wrong name as the result of the RDS robots.
* Robot authentication is cool, but doesn't go far enough. Give each =
robot the ability to store passwords for multiple resources. That way I =
can give access to resources without having to tell all of my server =
administrators to create a new user/pwd combo for the robot. They could =
logon as me, using whatever login/pwd is relevant for the current =
resource.
* I want an obvious, really easy way to TOTALLY PURGE all of the =
databases. In order to ensure that a test update is performing =
correctly.
* At end-user search time, I want ability to control case sensitivity.
* Have the ability to pull up in the browser, in the RDS and catalog =
admin, the raw text of the config files, etc.
* How do robot stats and the catalog db stats correlate? There is no =
obvious relationship. Again, the numbers don't add up.
* "Robot Report" is a stupid name, because that's not what it is. Call =
it "RDS Servers Seen Report." That would be accurate, at least.
* How do file names in the tmp dir and robot names correspond?
* In the catalog server, importing reports time in GMT, everywhere else =
in local.
* There is a huge (almost 2x) discreprency between the db size reported =
by your GUI tools and the actual size on disk. Explain this one.