views:

687

answers:

4

What's the best way to return the language of a given string? Using encoding trick or something.

Thanks

+6  A: 

Make a statistical analyses of the string: Split the string into words. Get a dictionary for every language you want to test for. And then find the language that has the highest word count.

In C# every string in memory will be unicode, and is not encoded. Also in text files the encoding is not stored. (Sometimes only an indication of 8-bit or 16-bit).

If you want to make a distinction between two languages, you might find some simple tricks. For example if you want to recognize English from Dutch, the string that contains the "y" is mostly English. (Unreliable but fast).

GvS
Are you saying there's no "y" in Dutch? I can give you 100 Dutch words with a "y" straight away.
Philippe Leybaert
This might be suitable for a beginning programming class, but is far from a real solution to the problem.
280Z28
But there is no 100% reliable language detection. If you want a fast distinction, unreliable between Dutch and English, counting the y's will perform very nice (that's what the "mostly" means).
GvS
+4  A: 

If you mean the natural (ie human) language, this is in general a Hard Problem. What language is "server" - English or Turkish? What language is "chat" - English or French? What language is "uno" - Italian or Spanish (or Latin!) ?

Without paying attention to context, and doing some hard natural language processing (<----- this is the phrase to google for) you haven't got a chance.

You might enjoy a look at Frengly - it's a nice UI onto the Google Translate service which attempts to guess the language of the input text...

AakashM
+5  A: 

A statistical approach using digraphs or trigraphs is a very good indicator. For example, here are the most common digraphs in English in order: http://www.letterfrequency.org/#digraph-frequency (one can find better or more complete lists). This method may have a better success rate than word analysis for short snippets of text because there are more digraphs in text than there are complete words.

Greg Hewgill
+1 for the link
Chris
+18  A: 

If the context of your code have internet access, you can try to use the Google API for language detection. http://code.google.com/apis/ajaxlanguage/documentation/

var text = "¿Dónde está el baño?";
google.language.detect(text, function(result) {
  if (!result.error) {
    var language = 'unknown';
    for (l in google.language.Languages) {
      if (google.language.Languages[l] == result.language) {
        language = l;
        break;
      }
    }
    var container = document.getElementById("detection");
    container.innerHTML = text + " is: " + language + "";
  }
});

And, since you are using c#, take a look at this article on how to call the API from c#.

Magnus Johansson