views:

891

answers:

3

Which is the best Java library for automatic language identification/classification?

Hypothetical syntax:

String languageCode = LanguageIdentificationAPI.identifyLanguage("Hello world.");
// languageCode would now contain "en" for English.

Thanks a lot in advance!

+1  A: 

The obvious solution, based on n-gram statistics and identification of common words, has been patented, so let the coder beware!

joel.neely
Thought you can't patent algorithms. Hmm...
biozinc
It's only valid in USA so who cares really :P Besides patents didn't stop people from using Marching Cubes algorithm all over the place.
Esko
+3  A: 

http://textcat.sourceforge.net/ ?

tuinstoel
A: 

Or using google's public language api, if remote access is an option:

try {
    String s = URLEncoder.encode("Há tantos burros mandando em homens de inteligência, que, às vezes, fico pensando que a burrice é uma Ciência", "UTF-8");
    URL url = new URL("http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&q="+s);
     BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
        String str;
        StringBuilder  buffer = new StringBuilder();
        while ((str = in.readLine()) != null) {
            buffer.append(str);
        }
        in.close();
        JSONObject obj = (JSONObject) ((JSONObject)JSONValue.parse(buffer.toString())).get("responseData");
        System.out.println(obj.get("language"));
        System.out.println(obj.get("confidence"));
    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
    } catch (MalformedURLException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}
andi