Re: Indexing two-byte text

Frank Smadja (smadja@netvision.net.il)
Thu, 07 Dec 1995 09:27:25 -0500


I am interested in this thread. Please keep it online or keep me posted.

Thanks

At 11:48 AM 12/7/95 JST, you wrote:
>>
>> Is there publicly available code to handle stemming for Japanese, or is
>> there a description of the algorithm involved anywhere (in English or in
>> Japanese)?
>
>Our Japanese "publisher" code will be made publicly
>available after 1) it is in decent shape, and 2) we
>get approval from management to release it (don't
>worry, we *WILL* get approval, one way or another :-).
>
>As for stemming. After making a weak attempt at finding
>out what other people are doing, we couldn't find
>anything about Japanese stemming. I think this may be
>because, since a dictionary is necessary simply to
>parse out the individual words, algorithmic stemming
>isn't really necessary. The stems are already in the
>dictionary.
>
>I wanted to minimize dependence on a dictionary, though,
>so we put our heads together and decided that effective
>stemming for Japanese simply requires removing any kana
>that appears after a kanji in a single "term". In other
>words, the kanji is the stem, in all cases. If the term
>has no kanji, then we don't stem at all.
>
>Though surely this simple algorithm must break for some
>cases, in our limited experience so far, we haven't found
>any problems.
>
>>
>> And what sort of garbage remains after using JUMAN?
>>
>
>JUMAN doesn't remove any text per se, just tries to separate
>out the individual terms. So, in general, text has all
>kinds of junk in it that isn't a valid term, including
>numbers, various symbols such as stars, circles, X's, etc.
>So, we try to filter as much of that out as we can without
>removing any valid stuff.
>
>As for JUMAN's term isolation ability, it suffers from a
>small dictionary. For example "intaanetto" (in romaji,
>"internet" in English) is broken into "intaa" and "netto",
>because JUMAN doesn't have "intaanetto" in its dictionary.
>I believe we'll be able to fix most of these by doing
>simple phrase detection. That is, if we see that "intaa"
>is always or very often followed by "netto", we can assume
>that they constitute a single phrase (or, in the no-white-space
>case, a single term). We will implement phrase detection
>next, and expect to have it by late January.
>
>PF
>
>ps. By the way, our Japanese publisher will be a single
>component of a multi-lingual publisher that will have
>language detection built in. We are doing Japanese and
>English, but expect to add others as they are done.
>
>pps. I really don't think this thread is so interesting
>to the robot list people. Maybe we should take it off-line.
>
>