views:

641

answers:

5

Hi All!

I'm planning to develop program in Java which will provide diagnosis. The data set is divided into two parts one for training and the other for testing. My program should learn to classify from the training data (BTW which contain answer for 30 questions each in new column, each record in new line the last column will be diagnosis 0 or 1, in the testing part of data diagnosis column will be empty - data set contain about 1000 records) and then make predictions in testing part of data :/

I've never done anything similar so I'll appreciate any advice or information about solution to similar problem.

I was thinking about Java Machine Learning Library or Java Data Mining Package but I'm not sure if it's right direction... ? and I'm still not sure how to tackle this challenge...

Please advise.

All the best!

+6  A: 

There are various algorithms that fall into the category of "machine learning", and which is right for your situation depends on the type of data you're dealing with.

If your data essentially consists of mappings of a set of questions to a set of diagnoses each of which can be yes/no, then I think methods that could potentially work include neural networks and methods for automatically building a decision tree based on the test data.

I'd have a look at some of the standard texts such as Russel & Norvig ("Artificial Intelligence: A Modern Approach") and other introductions to AI/machine learning and see if you can easily adapt the algorithms they mention to your particular data. See also O'Reilly, "Programming Collective Intelligence" for some sample Python code of one or two algorithms that might be adaptable to your case.

If you can read Spanish, the Mexican publishing house Alfaomega have also published various good AI-related introductions in recent years.

Neil Coffey
@ Neil Coffey - No knowledge of Spanish :( but I'll check O'Reilly book. Thank you.
Registered User
+6  A: 

This is a classification problem, not really data mining. The general approach is to extract features from each data instance and let the classification algorithm learn a model from the features and the outcome (which for you is 0 or 1). Presumably each of your 30 questions would be its own feature.

There are many classification techniques you can use. Support vector machines is popular as is maximum entropy. I haven't used the Java Machine Learning library, but at a glance I don't see either of these. The OpenNLP project has a maximum entropy implementation. LibSVM has a support vector machine implementation. You'll almost certainly have to modify your data to something that the library can understand.

Good luck!

Update: I agree with the other commenter that Russel and Norvig is a great AI book which discusses some of this. Bishop's "Pattern Recognition and Machine Learning" discusses classification issues in depth if you're interested in the down and dirty details.

Gann Bierner
Thanks, I've got a copy of this book, it's awesome indeed!
Registered User
+13  A: 

I strongly recommend you use Weka for your task
Its a collection of machine learning algorithms with a user friendly front-end which facilitates a lot of different kinds of feature and model selection strategies
You can do a lot of really complicated stuff using this without really having to do any coding or math
The makers have also published a pretty good textbook that explains the practical aspects of data mining
Once you get the hang of it, you could use its API to integrate any of its classifiers into your own java programs

adi92
Thank you, I'll have a look on Weka.
Registered User
The software and textbook are really good for getting your head around machine learning, I highly recommend them.
gverdouw
+1 for Weka. Another good toolkit is *RapidMiner*
Amro
+5  A: 

Hi As Gann Bierner said, this is a classification problem. The best classification algorithm for your needs I know of is, Ross Quinlan algorithm. It's conceptually very easy to understand.

For off-the-shelf implementations of the classification algorithms, the best bet is Weka. http://www.cs.waikato.ac.nz/ml/weka/. I have studied Weka but not used, as I discovered it a little too late.

I used a much simpler implementation called JadTi. It works pretty good for smaller data sets such as yours. I have used it quite a bit, so can confidently tell so. JadTi can be found at:

http://www.run.montefiore.ulg.ac.be/~francois/software/jaDTi/

Having said all that, your challenge will be building a usable interface over web. To do so, the dataset will be of limited use. The data set basically works on the premise that you have the training set already, and you feed the new test dataset in one step, and you get the answer(s) immediately.

But my application, probably yours also, was a step by step user discovery, with features to go back and forth on the decision tree nodes.

To build such an application, I created a PMML document from my training set, and built a Java Engine that traverses each node of the tree asking the user to give an input (text/radio/list) and use the values as inputs to the next possible node predicate.

The PMML standard can be found here: http://www.dmg.org/ Here you need the TreeModel only. NetBeans XML Plugin is a good schema-aware editor for PMML authoring. Altova XML can do a better job, but costs $$.

It is also possible to use an RDBMS to store your dataset and create the PMML automagically! I have not tried that.

Good luck with your project, please feel free to let me know if you need further inputs.

srini.venigalla
How can you so unequivocally state that decision trees are the best algorithm for the task at hand?
Steve Lianoglou
i said, "i know of", right?What do you suggest?
srini.venigalla
You're right, sorry. I guess I'd first try just running it through an SVM since it'd be pretty easy to do quickly (eg. just put the data in a format libsvm understands and run it through) and usually provides great performance relative to the amount of work you have to do to get it to work. You could try boosting, naive bayes, (penalized) logistic regression (check out "glmnet" w/ related reading) ... I'd be hard pressed to pick one as "the best," though.
Steve Lianoglou
@srini.venigalla Thank You for Your Input!
Registered User
+3  A: 

Your task is classical for neural networks, which are intended first of all to solve exactly classification tasks. Neural network has rather simple realization in any language, and it is the "mainstream" of "machine learning", closer to AI than anything other. You just implement (or get existing implementation) standart neural network, for example multilayered network with learning by error back propagation, and give it learning examples in cycle. After some time of such learning you will get it working on real examples. You can read more about neural networks starting from here: http://en.wikipedia.org/wiki/Neural%5Fnetwork http://en.wikipedia.org/wiki/Artificial%5Fneural%5Fnetwork Also you can get links to many ready implementations here: http://en.wikipedia.org/wiki/Neural%5Fnetwork%5Fsoftware