views:

587

answers:

7

I need to do natural language detection (with confidence scores), preferably in Java, I'd really not introduce more platforms/technologies at this stage of the project. I have previously used the Google API for this in a PoC, but I now need to scale up to very large amounts of data, so any web-based solution won't cut it, (also Google are unclear in their TOS).

+3  A: 

While it doesn't provide confidence scores, you could at least start from my cue.language library, and modify its language-detection stuff to return all of the languages it found, with a simple confidence score. (It would be an easy modification.)

Jonathan Feinberg
Nice! This is stop-word based, yes?
johanbev
Yes, exactly. It's very dumb, but "good enough" for my purposes!
Jonathan Feinberg
A: 

We sell one, with confidence scores. www.basistech.com. If you are an academic we can probably let you use if for free. If you need open source, of course, there are other answers being posted.

bmargulies
hey, he didn't say 'free'.
bmargulies
+3  A: 

You need to determine what language a text is in? I've used some n-gram-based algorithms for this. NGramJ is a Java-implementation. (One of several, I have no opinion on which Java-implementation is the better one)

Alex Brasetvik
Just to clarify, this uses character n-grams, which makes decisions based on the frequencies with which letters appear next to each other. I've heard that this kind of technique should be able to work on text as short as a few words' search query.
Ken Bloom
A: 

Try TextCat. It is free, in Java, and based on an N-Gram algorithm. You can poke a little and get a confidence score, I think. If the texts are long enough, I believe most of the suggestions here will give decent performance.

Yuval F
TextCat probably does word n-grams, which is unlikely to be what you want. Character n-grams are probably more appropriate.
Ken Bloom
No. Read the paper they refer to. TextCat does use character n-grams, which are indeed appropriate.
Yuval F
+2  A: 

You can try LingPipe. There's a tutorial for language id.

anno
+2  A: 

Take a look at Nutch' Language Identifier. I have used it couple of years back with excellent results.

Shashikant Kore
Ended up using this, good tool, and pretrained too! Integration took no more than 30 min. THX mate!
johanbev