views:

844

answers:

7

Is there any C# library which can detect the language of a particular piece of text? i.e. for an input text "This is a sentence", it should detect the language as "English". Or for "Esto es una sentencia" it should detect the language as "Spanish".

I understand that language detection from text is not a deterministic problem. But both Google Translate and Bing Translator have an "Auto detect" option, which best-guesses the input language. Is there something similar available publicly, preferably in C#?

A: 

You'll want a machine learning algorithm based on hidden markov chains, process a bunch of texts in different languages.

Then when it gets to the unidentified text, the language that has the closer 'score' is the winner.

Arafangion
+2  A: 

Here you have a simple detector based on bigram statistics (basically means learning from a big set which bigrams occur more frequently on each language and then count those in a piece of text, comparing to your previously detected values):

http://allantech.blogspot.com/2007/07/automatic-language-detection.html

This is probably good enough for many (most?) applications and doesn't require Internet access.

Of course it will perform worse than Google's or Bing's algorithm (which themselves aren't great). If you need excellent detection performance you would have to do both a lot of hard work and over huge amounts of data.

The other option would be to leverage Google's or Bing APIs if your app has Internet access.

Vinko Vrsalovic
In fact, this approach will give quite good results. It can be improved by using n-grams instead of bi-grams. However, it will always be difficult to tell very similar languages (e.g. Polish and Czech) apart. Languages such as Greek will be very easy though...
0xA3
To avoid misunderstandings, what would you call quite good in this context?
Vinko Vrsalovic
+2  A: 
dreamlax
You should think about a more generic n-grams based classifier based on a training corpus.
Luca Martinetti
A: 

You can utilize Google's translation webservice to do this.

leppie
A: 

There is a simple tool to identify text language: http://www.detectlanguage.com/

Laurynas
A: 

I've found that "textcat" is very useful for this. I've used a PHP implementation, PHP Text Cat, based on this this original implementation, and found it reliable. If you have a look at the sources, you'll find it's not a terrifyingly difficult thing to implement in the language of your choice. The hard work -- the letter combinations that are relevant to a particular language -- is all in there as data.

Matt Gibson