Indexing two-byte text

Mark Schrimsher (mschrimsher@twics.com)
Fri, 8 Dec 1995 16:51:06 +0900


francis@cactus.slab.ntt.jp (Paul Francis) wrote:

>Our Japanese "publisher" code will be made publicly
>available after 1) it is in decent shape, and 2) we
>get approval from management to release it (don't
>worry, we *WILL* get approval, one way or another :-).

You may get approval, but I assume that it couldn't be freely used for
commercial purposes?

>As for stemming. After making a weak attempt at finding
>out what other people are doing, we couldn't find
>anything about Japanese stemming. I think this may be
>because, since a dictionary is necessary simply to
>parse out the individual words, algorithmic stemming
>isn't really necessary. The stems are already in the
>dictionary.

>I wanted to minimize dependence on a dictionary, though,
>so we put our heads together and decided that effective
>stemming for Japanese simply requires removing any kana
>that appears after a kanji in a single "term". In other
>words, the kanji is the stem, in all cases. If the term
>has no kanji, then we don't stem at all.
>
>Though surely this simple algorithm must break for some
>cases, in our limited experience so far, we haven't found
>any problems.

I don't think perfection is necessary here anyway to produce a useful
system. But couldn't you just swap out the dictionary for a better
dictionary? I just got a copy of juman, though, and although I just glanced
at the files, it seemed like the dictionary was broken up by parts of
speech. But most new coinages in a language tend to be nouns I would think.
This could be a business opportunity for someone--just like software
companies in the U.S. buy their spell checkers from specialized companies,
someone could develop and market a morphological root dictionary for
Japanese.

>As for JUMAN's term isolation ability, it suffers from a
>small dictionary. For example "intaanetto" (in romaji,
>"internet" in English) is broken into "intaa" and "netto",
>because JUMAN doesn't have "intaanetto" in its dictionary.
>I believe we'll be able to fix most of these by doing
>simple phrase detection. That is, if we see that "intaa"
>is always or very often followed by "netto", we can assume
>that they constitute a single phrase (or, in the no-white-space
>case, a single term). We will implement phrase detection
>next, and expect to have it by late January.

Ha! A programmer's solution. It seems like just upping the dictionary is
more straightforward. ;-)

>ps. By the way, our Japanese publisher will be a single
>component of a multi-lingual publisher that will have
>language detection built in. We are doing Japanese and
>English, but expect to add others as they are done.

I'm not sure what you mean by a "publisher"--I'm not sure what this does.
Is this different from Ingrid?

--Mark