views:

129

answers:

3

I am dealing with an application that is accepting user input in different languages (currently 3 languages fixed). The requirement is that users can enter text and dont bother to select the language via a provided checkbox in the UI.

Is there an existing Java library to detect the language of a text?

I want something like this:

text = "To be or not to be thats the question."

// returns ISO 639 Alpha-2 code
language = detect(text);

print(language);

result:

EN

I dont want to know how to create a language detector by myself (i have seen plenty of blogs trying to do that). The library should provide a simple APi and also work completely offline. Open-source or commercial closed doesn't matter.

i also found this questions on SO (and a few more):

How to detect language
How to detect language of text?

+1  A: 

Google offers an API that can do this for you. I just stumbled across this yesterday and didn't keep a link, but if you, umm, Google for it you should manage to find it.

This was somewhere near the description of their translation API, which will translate text for you into any language you like. There's another call just for guessing the input language.

Google is among the world's leaders in mechanical translation; they base their stuff on extremely large corpuses of text (most of the Internet, kinda) and a statistical approach that usually "gets" it right simply by virtue of having a huge sample space.

EDIT: Here's the link: http://code.google.com/apis/ajaxlanguage/

EDIT 2: If you insist on "offline": A well upvoted answer was the suggestion of Guess-Language. It's a C++ library and handles about 60 languages.

Carl Smotricz
this? -> http://code.google.com/apis/ajaxlanguage/documentation/#Detect
potatopeelings
does it work offline?
ManBugra
@ManBugra: Only if you have a backup copy of Google's data storage facility handy ;)
Carl Smotricz
@potatopeelings: I just dove into Google and found the very same thing. I'll update my answer. Thanks!
Carl Smotricz
@potatopeelings: i have to admit my mother language isn't english but when i mention "offline" i mean it should work without any network connection (=no internet).
ManBugra
Guess-Language is python, so Jython should be able to run it from Java.
tulskiy
@tulskiy Oops, I was misled by the sentence "based on guesslanguage.cpp". Yep, good hint. Python and Java are not a match made in heaven, but they can be duct-taped together as you say.
Carl Smotricz
@ManBugra - sorry, my fault. i probably read offline but it didn't register. i think your usage is right btw.
potatopeelings
+1  A: 

An alternative is the JLangDetect but it's not very robust and has a limited language base. Good thing is it's an Apache license, if it satisfies your requirements, you can use it. Apparently, version 0.2 has been released here.

EDIT: You can also check the question what-is-a-good-tool-for-natural-language-detection-in-java

Manny
A: 

Here are two options

Jay Askren