views:

424

answers:

5

I am looking for a library that, ideally, has the following features:

  • implements hierarchical clustering of multidimensional data (ideally on similiarity or distance matrix)
  • implements support vector machines
  • is in C++
  • is somewhat documented (this one seems to be hardest)

I would like this to be in C++, as I am most comfortable with that language, but I will also use any other language if the library is worth it. I have googled and found some, but I do not really have the time to try them all out, so I want hear what other people had for experiences. Please only answer if you have some experience with the library you recommend.

P.S.: I could also use different libraries for the clustering and the SVM.

+8  A: 

WEKA (http://www.cs.waikato.ac.nz/ml/weka/) is an excellent open source machine learning library that meets most of your requirements except C++ - it is written in Java. It is very well documented, implements support vector machines and clustering and I have had very good experiences with it.

Finbarr
The svm implementation in weka is done by libsvm authors, not weka team. Also, weka totally does not have efficiency in its design. Weka is only good for playing with small datasets.
Yin Zhu
+6  A: 

There are only a few ML libraries that i have used enough so that i am comfortable recommending them; dlib ml is certainly one of them.

Sourceforge download here; and bleeding-edge check-out:

svn co https://dclib.svn.sourceforge.net/svnroot/dclib dclib

The original library creator and current maintainer is Davis King.

Your wishlist versus the relevant dlib features:

  • good documentation: for free, open-source libraries directed at a relatively small group of users/developers, this is probably as good as it gets; aside from the usual docs, refined during the five-year dev history, there's a frequently updated Intro to dlib, a (low-traffic) forum; and a large set of excellent examples (including at least one for SVM).

  • C++: 100% in C++ as far as i know.

  • Support-Vector Machine algorithm: yep; in fact, the SVM modules have been the focus of the most recent updates to this Library.

  • Hierarchical Clustering algorithm: not out of the box; there is however, packaged code for k-means clustering. Obviously the results from each technique are very different, but calculation of the similarity metric and the subsequent recursive/iterative partitioning step are at the heart of both--in other words, the computation engine for hierarchical clustering is all there. To adapt the extant clustering module for HC, will take more than a couple lines of code, but it's also not a major endeavor given that you're working almost at the data-presentation level.

dlib ml has a few additional points to recommend it. It's a mature library (it's at version 17.x now, version 1.x was released sometime in late 2005, i believe) yet it also remains under active development, as evidenced by the repo logs (the last update, 17.27, was 17 May 2010) and the last commit (23 May 2010). In addition, it also includes quite few other ML techniques (eg., Bayesian Networks, Kernel Methods, etc.). And third, dllib ml has excellent "support" libraries for matrix computation and optimization--both of which are fundamental building blocks of many ML techniques.

In the source, i've noticed that dlib ml is licensed under BSL (Boost?), which is an open source license, though I don't know anything else about this type of license.

doug
+2  A: 

Ok, for completeness sake i will post with what I went in the end. I am using scipy-cluster for the clustering part now. It is the most versatile implementation I have found so far. I think I will go with libSVM (it has a Python interface now) for the SVM part. I am going with Python because there was really no fitting implementation of hierarchical clustering in C++ to be found (the C Clustering Library is specialized for microarrays and does not support multidimensional data).

Space_C0wb0y
+4  A: 

General machine learning libraries:

These two are similar to Weka. However they have efficiency in mind.

1.Shark (GPL)

http://shark-project.sourceforge.net/

2.Waffles (LGPL)

http://waffles.sourceforge.net/

SVM and other linear classifiers:

1.LibSVM (BSD style)

2.LibLinear (BSD style)

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

All of them are in C++.

Yin Zhu
+1  A: 

It's not C++, but you consider using R. In particular, have a look at the machine learning view on CRAN, which shows many of the above libraries including Weka and libsvm.

griffin