views:

966

answers:

7

Hello Everyone, I want to know what is the best open source java based framework for Text Mining, to use botg Machine Learning and dictionary Methods.

I'm using Mallet but there are not that much documentation and I do not know if it will fit all my requirements.

Thanks in advance. Best Regards,

ukrania

+2  A: 

Although not a specialized text mining framework, Weka has a number of classifiers usually employed in text mining tasks such as: SVM, kNN, multinomial NaiveBayes, among others.

It also has a few filters to wok with textual data like the StringToWordVector filter which can perform TF/IDF transformation.

Check out the Weka wiki website for more information.

Amro
The problem is that I need to perform Named Entity Recognition (NER), and Weka does not provide features to extract features from words, such as orthographic and morphological characteristics. But it will be cool if I can use Weka's methods on IR.
ukrania
I think Wikipedia page on the topic has a few links to some packages for NER. Also I just came across UIMA project by Apache, perhaps you'll find it useful: http://incubator.apache.org/uima/index.html
Amro
Yes I know UIMA. But it does not provide ML Methods. It is a perfect solution for systems that make NER with dictionary-based approaches. I don't know how to integrate ML methods on UIMA.
ukrania
A: 

Maybe have a look at Java Open Source NLP and Text Mining tools.

Pascal Thivent
I've already seen this web site, it is really nice, thanks. But I was asking for your experience feedback. I've already tried some of them but I don't know which one is the best. Or even if I have to use one, two or maybe more frameworks to accomplish my task.
ukrania
@ukrania Sorry, I'm not the right person then. Good luck.
Pascal Thivent
A: 

We use lucene to process live streams from the internet. It has a native java api.

http://lucene.apache.org/java/docs/

You can then use mahout which is a bunch of machien learning algorithms which operate on top of lucene.

http://lucene.apache.org/mahout/

steve
It it possible to use mahout to perform NER?
ukrania
+1  A: 

I've used LingPipe -- a suite of Java libraries for the linguistic analysis of human language -- for text mining (and other related) tasks.

It is a very well documented software package, and the site contains several tutorials which thoroughly explain how to do a certain task with LingPipe, such as named entity recognition. There is also a newsgroup, wherein you can post any question you have about the software (or NLP related tasks), and have a prompt reply from the authors of the package themselves; and of course, a blog.

The source code is also very easy to follow and well documented which, for me, is always a big plus.

As for Machine Learning algorithms, there are plenty, from Naïve Bayes to Conditional Random Field. On the other hand, for dictionary-matching algorithms, they have an ExactDicitonaryChunker, which is an implementation of the Aho-Corasich algorithm (a very, very, fast algorithm for this task).

In sum, I think it is one of the best NLP software package for Java (I haven't used every single package that is out there, so I can't say it's the best), and I definitely recommend it for the task that you have at hand.

JG
@JG Thanks for your advice :). I'm doing my system for research. I've to pay something even if I make a commercial tool? What are the limitations?
ukrania
A: 

You may already know about GATE: http://gate.ac.uk/

...but that's what we've used (at my day job) for lots of different text mining problems. It's pretty flexible and open.

PSpeed
@PSpeed Yes I already know it. GATE is very similar to UIMA. Actually, GATE was the first one to emerge. However, I don't know if it is possible to perform ML methods with GATE. Do you know something about that?
ukrania
I think GATE is more flexible too... we found UIMA to be very confining. I don't have specific experience with ML but it just seemed like if someone was working on it then GATE would be a likely platform. It's where I might start if I were writing something like that... but I haven't searched for any specific projects.
PSpeed
Looks like there has been at least some work in ML and GATE: http://gate.ac.uk/gate/doc/plugins.html#Machine_Learning
PSpeed
+1  A: 

I built a maximum entropy named entity recognizer for CoNLL data using OpenNLP MaxEnt http://sourceforge.net/projects/maxent/ for a course once.

Required a lot of data preprocessing with custom perl scripts do get all the features extracted into nice neat numerical vectors though.

paul
+2  A: 

I honestly think that the several answers presented here are very good. However, to fulfill my requirements I have chosen to use Apache UIMA with ClearTK. It supports several ML Methods and I do not have any licences problem. Plus, I can make wrappers to other ML methodologies, and I take the advantage of the UIMA framework, which is very well organized and fast.

Thank you all for your interesting answers.

Best Regards, ukrania

ukrania