ansaurus

Question

Calculating Nearest Match to Mean/Stddev Pair With LibSVM

Answer 1

+2 A:

If you are interested in a different way of doing this, you could do the following. This way is theoretically more sound, however not as straightforward.

By mentioning mean and std, it seems as if you refer to data that you assume to be distributed in some way. E.g., the data you observer is Gaussian distributed. You can then use the Symmetrised Kullback-Leibler_divergence as a distance measure between those distributions. You can then use something like k-nearest neighbour to classify.

For two probability densities p and q, you have KL(p, q) = 0 only if p and q are the same. However, KL is not symmetric - so in order to have a proper distance measure, you can use

distance(p1, p2) = KL(p1, p2) + KL(p1, p2)

For Gaussians, KL(p1, p2) = { (μ1 - μ2)^2 + σ1^2 - σ2^2 } / (2.σ2^2) + ln(σ2/σ1). (I stole that from here, where you can also find a deviation :)

Long story short:

Given a training set D of (mean, std, class) tuples and a new p = (mean, std) pair, find that q in D for which distance(d, p) is minimal and return that class.

To me that feels better as the SVM approach with several kernels, since the way of classifying is not so arbitrary.

bayer 2010-04-02 21:44:19

Thanks. I figured there was probably something better than a SVM for normal/gaussian distributions. However, I also intend to include these guassian features with other arbitrary features, so k-nn using a specialized distance measure would not be appropriate.

Chris S 2010-04-03 00:26:29

There are actually ways to learn such distance measures from class labels. Maybe you want to checkout Sam Roweis' work on Neighbourhood Component Analysis.

bayer 2010-04-06 12:44:31

Answer 2

+3 A:

The problem seems to be coming from combining multiclass prediction with probability estimates.

If you configure your code not to make probability estimates, it actually works, e.g.:

<snip>
# Test classifiers.
kernels = [LINEAR, POLY, RBF]
kname = ['linear','polynomial','rbf']
correct = defaultdict(int)
for kn,kt in zip(kname,kernels):
  print kt
  param = svm_parameter(kernel_type = kt, C=10) # Here -> rm probability = 1
  model = svm_model(problem, param)
  for test_sample,correct_label in test:
      # Here -> change predict_probability to just predict
      pred_label = model.predict(test_sample)
      correct[kn] += pred_label == correct_label
</snip>

With this change, I get:

--------------------------------------------------------------------------------
Accuracy:
        polynomial 1.000000 (4 of 4)
        rbf 1.000000 (4 of 4)
        linear 1.000000 (4 of 4)

Prediction with probability estimates does work, if you double up the data in the training set (i.e., include each data point twice). However, I couldn't find anyway to parametrize the model so that multiclass prediction with probabilities would work with just the original four training points.

dmcer 2010-04-03 04:25:06

ansaurus

tags:

views:

answers:

Calculating Nearest Match to Mean/Stddev Pair With LibSVM

related questions