...
> that incidentally I found. But I'm suprised by the increasing number
> of documents that I can't understand, simply because they're written
> in a foreign language (foreign to me, that is nor french nor english),
> not to speak of non iso-8859 files, such as japanese ones.
...
> A more interesting approach is the indexer trying to figure the
> language of the document, based may be on a statistical analysis.
> Probably, problems will arise with mixed languages files.
An easy way to tell might be by examination of stopwords. If a document
has lots of words like "an", "to", "be", "by", "of", "if", "a", "the",
"in", "this", "then", "it", "at" and "some" then it probably contains at
least some English. "Le", "la", "les", "un", "une", "en", "au", "de",
"des" point to French.
The advantage is that you would need a relatively small number of words
for each language, not the whole dictionary.
Of course this approach might not separate very closely related
languages.
-Fritz