I've just spent a few hours looking with alta-vista for informations,
that incidentally I found. But I'm suprised by the increasing number
of documents that I can't understand, simply because they're written
in a foreign language (foreign to me, that is nor french nor english),
not to speak of non iso-8859 files, such as japanese ones.
The documents put on the Web used to be written by researchers, for
whom english is mandatory, but they are likely to be outnumbered by
the texts created by all the not-researcher-nor-computer-professional,
anyone-like that are now most of the people using Internet and the
Web. This is a great thing for sure, but the malediction of the Babel
Tower is still on us, and a not-so-great effect is the dilution of
documents one can understand when performing a research using an
indexer.
A simple solution: tagging the file with the language. For example,
using an HTTP-EQUIV meta and an ISO 639 code, we got something like
<META HTTP-EQUIV="Language" CONTENT="en"> for english. Of course, this
is useful only if 1) the indexers give the ability to select only a
given set of languages and 2) many people do it.
A more interesting approach is the indexer trying to figure the
language of the document, based may be on a statistical analysis.
Probably, problems will arise with mixed languages files.
What do you think of that ? Has this been done by someone ?
+--------------------------+------------------------------------+
| | |
| Christophe TRONCHE | E-mail : tronche@lri.fr |
| | |
| +-=-+-=-+ | Phone : 33 - 1 - 69 41 66 25 |
| | Fax : 33 - 1 - 69 41 65 86 |
+--------------------------+------------------------------------+
| ###### ** |
| ## # Laboratoire de Recherche en Informatique |
| ## # ## Batiment 490 |
| ## # ## Universite de Paris-Sud |
| ## #### ## 91405 ORSAY CEDEX |
| ###### ## ## FRANCE |
|###### ### |
+---------------------------------------------------------------+