ansaurus

Question

How do I tell what language is a plain-text file written in ?

Answer 1

A:

You probably need to do something with Frequency Analysis

http://en.wikipedia.org/wiki/Letter_frequencies

Bas 2010-02-24 12:54:14

Letter frequency is a step in the right direction, but not enough. As an example, in Italian you will never find a word with "aa" inside, or "cha" (while both "ch" and "ha" are fairly common). This is why I suggested Markov Chains: they basically model how probable is to find a letter after another (or a longer string).So you will have "a followed by another a" at 0% in Italian, while maybe common in Dutch...

p.marino 2010-02-24 13:10:34

Answer 2

+4 A:

Language detection by Google: http://code.google.com/apis/ajaxlanguage/documentation/#Detect

cherouvim 2010-02-24 12:54:27

Answer 3

A:

Do you have connection to the internet if you do then Google Language API would be perfect for you.

// This example request includes an optional API key which you will need to
// remove or replace with your own key.
// Read more about why it's useful to have an API key.
// The request also includes the userip parameter which provides the end
// user's IP address. Doing so will help distinguish this legitimate
// server-side traffic from traffic which doesn't come from an end-user.
URL url = new URL(
    "http://ajax.googleapis.com/ajax/services/search/web?v=1.0&amp;"
    + "q=Paris%20Hilton&key=INSERT-YOUR-KEY&userip=USERS-IP-ADDRESS");
URLConnection connection = url.openConnection();
connection.addRequestProperty("Referer", /* Enter the URL of your site here */);

String line;
StringBuilder builder = new StringBuilder();
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
while((line = reader.readLine()) != null) {
 builder.append(line);
}

JSONObject json = new JSONObject(builder.toString());
// now have some fun with the results...

If you don't there are other methods.

Laykes 2010-02-24 12:54:45

Answer 4

+15 A:

There is a package called JLangDetect which seems to do exactly what you want:

langof("un texte en français") = fr : OK
langof("a text in english") = en : OK
langof("un texto en español") = es : OK
langof("un texte un peu plus long en français") = fr : OK
langof("a text a little longer in english") = en : OK
langof("a little longer text in english") = en : OK
langof("un texto un poco mas largo en español") = es : OK
langof("J'aime les bisounours !") = fr : OK
langof("Bienvenue à Montmartre !") = fr : OK
langof("Welcome to London !") = en : OK
// ...

Edit: as Kevin pointed out, there is similar functionality in the Nutch project provided by the package org.apache.nutch.analysis.lang.

Otto Allmendinger 2010-02-24 12:54:53

why is there no german example?

Chris 2010-02-24 13:07:32

@Chris Well, a good question. I know one phrase only, let's see if I can write it correctly.

EugeneP 2010-02-24 13:12:05

I don't know, but German is listed as a supported language

Otto Allmendinger 2010-02-24 13:13:22

@EugeneP: try `Dies ist ein kurzer deutscher Text`

Otto Allmendinger 2010-02-24 13:13:48

hi there! just for the shake of completeness, "longer" is spelled "largo" in Spanish :)

egarcia 2010-02-24 13:14:00

JLangDetect is less accurate than Nutch, especially on shorter text.

Kevin Peterson 2010-02-25 08:29:46

@Kevin: The only API I found for language detection with Nutch is based on http, which isn't an option as the asker stated

Otto Allmendinger 2010-02-25 08:58:51

org.apache.nutch.analysis.lang package is what we use

Kevin Peterson 2010-02-26 01:19:38

Answer 5

+3 A:

Look up Markov chains.

Basically you will need statistically significant samples of the languages you want to recognize. When you get a new file, see what the frequencies of specific syllables or phonemes are, and compare the the pre-calculated sample. Pick the closest one.

p.marino 2010-02-24 12:55:08

Answer 6

+5 A:

For larger corpi of texts you usually use the distribution of letters, digraphs and even trigraphs and compare with known distributions for languages you want to detect.

However, a single sentence is very likely too short to yield any useful statistical measures. You may have more luck with matching individual words with a dictionary, then.

Joey 2010-02-24 12:56:52

Answer 7

+2 A:

Although a more complicated solution than you are looking for, you could use Vowpal Wabbit and train it with sentences from different languages.

In theory you could get back a language for every sentence in your documents.

http://hunch.net/~vw/

(Don't be fooled by the "online" in the project's subtitle - that's just mathspeak for learns without having to have whole learning material in memory)

Daniel Von Fange 2010-02-24 13:23:57

Thank you for your answer.

EugeneP 2010-02-24 13:44:57

Answer 8

+2 A:

NGramJ seems to be a bit more up-to-date:

http://ngramj.sourceforge.net/

It also has both character-oriented and byte-oriented profiles, so it should be able to identify the character set too.

For documents in multiple languages you need to identify the character set (ICU4J has a CharsetDetector that can do this), then split the text on something resonable like multiple line breaks, or paragraphs if the text is marked up.

Andrew Duffy 2010-02-24 13:25:19

Thank you for your answer.

EugeneP 2010-02-24 13:44:37

Answer 9

+2 A:

Try Nutch's Language Identifier. It is trained with n-gram profiles of languages and profile of available languages is matched with input text. Interesting thing is you can add more languages, if you need.

Shashikant Kore 2010-02-25 05:53:32

We use nutch's language identifier with very good results. It's a standard implementation of a bigram model that works for languages sharing a character set.

Kevin Peterson 2010-02-25 08:23:03

Answer 10

+1 A:

If you are interested in the mechanism by which language detection can be performed, I refer you to the following article (python based) that uses a (very) naive method but is a good introduction to this problem in particular and machine learning (just a big word) in general.

For java implementations, JLangDetect and Nutch as suggested by the other posters are pretty good. Also take a look at Lingpipe, JTCL and NGramJ.

For the problem where you have multiple languages in the same page, you can use a sentence boundary detector to chop a page into sentences and then attempt to identify the language of each sentence. Assuming that a sentence contains only one (primary) language, you should still get good results with any of the above implementations.

Note: A sentence boundary detector (SBD) is theoretically language specific (chicken-egg problem since you need one for the other). But for latin-script based languages (English, French, German, etc.) that primarily use periods (apart from exclamations etc.) for sentence delimiting, you will get acceptable results even if you use an SBD designed for English. I wrote a rules-based English SBD that has worked really well for French text. For implementations, take a look at OpenNLP.

An alternative option to using the SBD is to use a sliding window of say 10 tokens (whitespace delimited) to create a pseudo-sentence (PS) and try and identify the border where the language changes. This has the disadvantage that if your entire document has n tokens, you will perform approximately n-10 classification operations on strings of length 10 tokens each. In the other approach, if the average sentence has 10 tokens, you would have performed approximately n/10 classification operations. If n = 1000 words in a document, you are comparing 990 operations versus 100 operations: an order of magnitude difference.

If you have short phrases (under 20 characters), accuracy of language detection is poor in my experience. Particularly in the case of proper nouns as well as nouns that are same across languages like "chocolate". E.g. Is "New York" an English word or a French word if it appears in a French sentence?

hashable 2010-02-25 08:17:57

Answer 11

A:

bigram models perform well, are simple to write, simple to train, and require only a small amount of text for detection. The nutch language identifier is a java implementation we found and used with a thin wrapper.

We had problems with a bigram model for mixed CJK and English text (i.e. a tweet is mostly Japanese, but has a single english word). This is obvious in retrospect from looking at the math (Japanese has many more characters, so the probabilities of any given pair are low). I think you could solve this with some more complicated log-linear comparison, but I cheated and used a simple filter based on character sets that are unique to certain languages (i.e. if it only contains unified Han, then it's Chinese, if it contains some Japanese kana and unified Han, then it's Japanese).

Kevin Peterson 2010-02-25 08:27:14

Answer 12

A:

读取这个文件得到它的编码类型就行了

Cong De Peng 2010-02-25 09:02:19

ansaurus

tags:

views:

answers:

How do I tell what language is a plain-text file written in ?

related questions