What is a good tool for Natural Language Detection in Java?

views:

587

answers:

+4 Q:

What is a good tool for Natural Language Detection in Java?

I need to do natural language detection (with confidence scores), preferably in Java, I'd really not introduce more platforms/technologies at this stage of the project. I have previously used the Google API for this in a PoC, but I now need to scale up to very large amounts of data, so any web-based solution won't cut it, (also Google are unclear in their TOS).

+3 A:

While it doesn't provide confidence scores, you could at least start from my cue.language library, and modify its language-detection stuff to return all of the languages it found, with a simple confidence score. (It would be an easy modification.)

Jonathan Feinberg 2009-12-17 19:02:31

Nice! This is stop-word based, yes?

johanbev 2009-12-17 19:07:31

Yes, exactly. It's very dumb, but "good enough" for my purposes!

Jonathan Feinberg 2009-12-17 19:07:57

We sell one, with confidence scores. www.basistech.com. If you are an academic we can probably let you use if for free. If you need open source, of course, there are other answers being posted.

bmargulies 2009-12-17 19:03:25

hey, he didn't say 'free'.

bmargulies 2009-12-17 19:05:29

+3 A:

You need to determine what language a text is in? I've used some n-gram-based algorithms for this. NGramJ is a Java-implementation. (One of several, I have no opinion on which Java-implementation is the better one)

Alex Brasetvik 2009-12-17 19:03:33

Just to clarify, this uses character n-grams, which makes decisions based on the frequencies with which letters appear next to each other. I've heard that this kind of technique should be able to work on text as short as a few words' search query.

Ken Bloom 2009-12-21 00:41:55

Try TextCat. It is free, in Java, and based on an N-Gram algorithm. You can poke a little and get a confidence score, I think. If the texts are long enough, I believe most of the suggestions here will give decent performance.

Yuval F 2009-12-17 19:52:11

TextCat probably does word n-grams, which is unlikely to be what you want. Character n-grams are probably more appropriate.

Ken Bloom 2009-12-21 00:40:09

No. Read the paper they refer to. TextCat does use character n-grams, which are indeed appropriate.

Yuval F 2009-12-21 06:56:21

+2 A:

You can try LingPipe. There's a tutorial for language id.

anno 2009-12-17 21:58:08

+2 A:

Take a look at Nutch' Language Identifier. I have used it couple of years back with excellent results.

Shashikant Kore 2009-12-18 10:26:09

Ended up using this, good tool, and pretrained too! Integration took no more than 30 min. THX mate!

johanbev 2009-12-20 15:56:10

Try this one! http://developer.cybozu.co.jp/oss/2010/10/language-detect.html

tnnrg 2010-10-15 04:45:25

ansaurus

tags:

views:

answers:

What is a good tool for Natural Language Detection in Java?

related questions