Re: Indexing two-byte text

Paul Francis (francis@cactus.slab.ntt.jp)
Wed, 6 Dec 95 16:46:55 JST


>
> here at the Univ. of Tokyo we are currently installing Harvest and were
> wondering if anybody has experience with the problems encountered
> when indexing Japanese text. (no word boundaries, two-byte code etc.)
> I would be very grateful for any help pointing me to an international
> version of agrep/glimpse or something similar.
>

We are doing a multi-lingual navigation project
(called Ingrid) that involves indexing Japanese
text. We use JUMAN to extract japanese text
(because it is public domain---it actually doesn't
do such a good job), and some home grown perl
stuff to filter out garbage, weight terms, and
do stemming.

But, for searching, we are for now doing exact
string matching only.

I suggest you ask this question on the
comp.infosystems.harvest and also on the
winter (web internationalization) mailing list
at winter@dorado.crpht.lu. (please see
http://dorado.crpht.lu:80/~carrasco/winter/
for the winter web page).

I think there may be some mule tools for international
grep like things, but I'm not absolutely sure
about it...

PF