views:

204

answers:

2

I'm trying to classify an example, which contains discrete and continuous features. Also, the example represents sparse data, so even though the system may have been trained on 100 features, the example may only have 12.

What would be the best classifier algorithm to use to accomplish this? I've been looking at Bayes, Maxent, Decision Tree, and KNN, but I'm not sure any fit the bill exactly. The biggest sticking point I've found is that most implementations don't support sparse data sets and both discrete and continuous features. Can anyone recommend an algorithm and implementation (preferably in Python) that fits these criteria?

Libraries I've looked at so far include:

  1. Orange (Mostly academic. Implementations not terribly efficient or practical.)
  2. NLTK (Also academic, although has a good Maxent implementation, but doesn't handle continuous features.)
  3. Weka (Still researching this. Seems to support a broad range of algorithms, but has poor documentation, so it's unclear what each implementation supports.)
+2  A: 

Support vector machines? libsvm can be used from Python, and is quite speedy.

Handles sparse vector inputs, and won't mind if some of the features are continuous, where others are just -1/+1. (If you've got an n-way discrete feature, the standard thing to do is expand it into n binary features.)

Jay Kominek
Interesting. Although I've heard of them before, I don't have much experience with SVMs. However, can't it be difficult to find an appropriate kernel?
Chris S
I'm finding libsvm to be severely lacking in documentation, and there's no community forum. If it supports sparse data sets, the feature is very well hidden. The *single* Python example included in the distribution uses a dense data set, even though the other training files appear to be formatted in a sparse style.
Chris S
+2  A: 

Weka (Java) satisfies all you requirements:

Check out this Pentaho wiki for a list of links to documentations, guides, video tutorials, etc ...

Amro
Adding to this, there is a Python binding for Weka implemented in NLTK.
ferdystschenko