Re: Indexing two-byte text

Mark Schrimsher (mschrimsher@twics.com)
Thu, 7 Dec 1995 11:18:57 +0900


At 4:46 PM 12/6/95, Paul Francis wrote:
>We are doing a multi-lingual navigation project
>(called Ingrid) that involves indexing Japanese
>text. We use JUMAN to extract japanese text
>(because it is public domain---it actually doesn't
>do such a good job), and some home grown perl
>stuff to filter out garbage, weight terms, and
>do stemming.

Is there publicly available code to handle stemming for Japanese, or is
there a description of the algorithm involved anywhere (in English or in
Japanese)?

And what sort of garbage remains after using JUMAN?

--Mark