Indexing two-byte text

Harry Munir Behrens (behrens@mtl.t.u-tokyo.ac.jp)
Fri, 08 Dec 1995 00:26:27 +0900


Hi guys,

terrific echo, thanks to all that were interested and helpful.
I have asked around some more in the university circus
and we have arrived at the following project plan:

We are putting in place a three-phase system based on JUMAN
(for now) and an existing dictionary based rule-based system.
In the first phase the system scans the text looking for
two- and four- kanji components that the dictionary knows.
This are singled out as "sure hits" and are stemmed were appropriate.
In the second phase we run JUMAN over the resulting text.
The third phase is going to be very similar to the first, but
will be only for verificaction purposes; meaning that if JUMAN generates
terms the dictionary doesn't know about error messages are ouput.

The fourth stage is manual editing of these error messages :-(

If there's anybody out there who is interested in more detailed info
please get in touch on : behrens@mtl.t.u-tokyo.ac.jp
I'm happy for any comments, suggestions etc.

Harry Behrens
PhD. candidate
Dept. of Electrical Engineering
Univ. of Tokyo
behrens@mtl.t.u-tokyo.ac.jp