The last I used was weka . The last I heard java was coming up with an API (JDM) for it. Can anyone share their experiences with the tools. I am mostly interested in using the tools for classification/clustering (weka does a decent job here) and the tool should have good API support.
We used Weka in some software we developed for classification and clustering. I'm no expert on data mining, but the team that evaluated it along with a number of other products certainly know their stuff, and generally are used to using very expensive off the shelf stuff.
I am using RapidMiner (formerly YALE from Univ. of Dortmund). Its a Java-based open source tool and implements most of the popular classifier/clustering methods. And it also ships with algorithms implemented for the Weka toolkit, so there are more options there. Comes with a GUI which is quite easy to use, and a Java-based API.
Weka is a popular data-mining platform, with a number of textbook algorithms implemented for classification, clustering, etc. It is great for rapid prototyping, i.e. quickly setting up a system and validating that it does what it was intended for.
There are two main issues with Weka however. The first is that it is distributed under a GPL license which means that you cannot use it as part of a commercial package and you cannot modify it and not publish the changes. Also, another weakness in Weka is that it doesn't handle large amounts of data. If your data cannot fit in the memory of your computer then you have an issue.
Both these issues are addressed with the Apache Mahout package. It is relatively new and lacks some functionality but depending on the data mining problems you have may be the right choice for you
According to the yearly KDnuggets Polls 2007, 2008, and 2009, RapidMiner is the most widely used Open Source Data Mining Solution among data mining experts world-wide: KDnuggets Data Mining Tool Poll 2009
RapidMiner is open source and 100% Java, RapidMiner is much more flexible and offers significantly more functionality than Weka.
You really should check out the Orange data mining toolkit. It comes with a drag and drop gui as well as a Python API.