Re: Indexing two-byte text

Paul Francis (francis@cactus.slab.ntt.jp)
Thu, 7 Dec 95 11:48:24 JST


>
> Is there publicly available code to handle stemming for Japanese, or is
> there a description of the algorithm involved anywhere (in English or in
> Japanese)?

Our Japanese "publisher" code will be made publicly
available after 1) it is in decent shape, and 2) we
get approval from management to release it (don't
worry, we *WILL* get approval, one way or another :-).

As for stemming. After making a weak attempt at finding
out what other people are doing, we couldn't find
anything about Japanese stemming. I think this may be
because, since a dictionary is necessary simply to
parse out the individual words, algorithmic stemming
isn't really necessary. The stems are already in the
dictionary.

I wanted to minimize dependence on a dictionary, though,
so we put our heads together and decided that effective
stemming for Japanese simply requires removing any kana
that appears after a kanji in a single "term". In other
words, the kanji is the stem, in all cases. If the term
has no kanji, then we don't stem at all.

Though surely this simple algorithm must break for some
cases, in our limited experience so far, we haven't found
any problems.

>
> And what sort of garbage remains after using JUMAN?
>

JUMAN doesn't remove any text per se, just tries to separate
out the individual terms. So, in general, text has all
kinds of junk in it that isn't a valid term, including
numbers, various symbols such as stars, circles, X's, etc.
So, we try to filter as much of that out as we can without
removing any valid stuff.

As for JUMAN's term isolation ability, it suffers from a
small dictionary. For example "intaanetto" (in romaji,
"internet" in English) is broken into "intaa" and "netto",
because JUMAN doesn't have "intaanetto" in its dictionary.
I believe we'll be able to fix most of these by doing
simple phrase detection. That is, if we see that "intaa"
is always or very often followed by "netto", we can assume
that they constitute a single phrase (or, in the no-white-space
case, a single term). We will implement phrase detection
next, and expect to have it by late January.

PF

ps. By the way, our Japanese publisher will be a single
component of a multi-lingual publisher that will have
language detection built in. We are doing Japanese and
English, but expect to add others as they are done.

pps. I really don't think this thread is so interesting
to the robot list people. Maybe we should take it off-line.