views:

370

answers:

3

More specifically, I'm trying to check if given string (a sentence) is in Turkish.

I can check if the string has Turkish characters such as Ç, Ş, Ü, Ö, Ğ etc. However that's not very reliable as those might be converted to C, S, U, O, G before I receive the string.

Another method is to have the 100 most used words in Turkish and check if the sentence includes any/some of those words. I can combine these two methods and use a point system.

What do you think is the most efficient way to solve my problem in Python?

Related question: (human) Language of a document (Perl, Google Translation API)

+12  A: 

One option would be to use a Bayesian Classifier such as Reverend. The Reverend homepage gives this suggestion for a naive language detector:

from reverend.thomas import Bayes
guesser = Bayes()
guesser.train('french', 'le la les du un une je il elle de en')
guesser.train('german', 'der die das ein eine')
guesser.train('spanish', 'el uno una las de la en')
guesser.train('english', 'the it she he they them are were to')
guesser.guess('they went to el cantina')
guesser.guess('they were flying planes')
guesser.train('english', 'the rain in spain falls mainly on the plain')
guesser.save('my_guesser.bay')

Training with more complex token sets would strengthen the results. For more information on Bayesian classification, see here and here.

Daniel
+8  A: 

A simple statistical method that I've used before:

Get a decent amount of sample training text in the language you want to detect. Split it up into trigrams, e.g.

"Hello foobar" in trigrams is: 'Hel', 'ell', 'llo', 'lo ', 'o f', ' fo', 'foo', 'oob', 'oba', 'bar'

For all of the source data, count up the frequency of occurrence of each trigram, presumably in a dict where key=trigram and value=frequency. You can limit this to the top 300 most frequent 3-letter combinations or something if you want. Pickle the dict away somewhere.

To tell if a new sample of text is written in the same language, repeat the above steps for the sample text. Now, all you have to do is compute a correlation between the sample trigram frequencies and the training trigram frequencies. You'll need to play with it a bit to pick a threshold correlation above which you are willing to consider input to be turkish or not.

This method has been shown to be highly accurate, beating out more sophisticated methods, see

Cavnar & Trenkle (1994): "N-Gram-Based Text Categorization"

Using trigrams solves the problem of using word lists, as there is a vast number of words in any given language, especially given different grammatical permutations. I've tried looking for common words, the problem is they often give a false positive for some other language, or themselves have many permutations. The statistical method doesn't require a lot of storage space and does not require complex parsing. By the way this method only works for languages with a phonetic writing system, it works poorly if at all with languages that use an ideographic language (i.e. Chinese, Japanese, Korean).

Alternatively wikipedia has a section on Turkish in its handy language recognition chart.

ʞɔıu
A: 

Why not just use an existing spell checking library? Spell check for several languages, choose language with lowest error count.

Kim