views:

169

answers:

2

I'm new to SVMs, and I'm trying to use the Python interface to libsvm to classify a sample containing a mean and stddev. However, I'm getting nonsensical results.

Is this task inappropriate for SVMs or is there an error in my use of libsvm? Below is the simple Python script I'm using to test:

#!/usr/bin/env python
# Simple classifier test.
# Adapted from the svm_test.py file included in the standard libsvm distribution.
from collections import defaultdict
from svm import *
# Define our sparse data formatted training and testing sets.
labels = [1,2,3,4]
train = [ # key: 0=mean, 1=stddev
    {0:2.5,1:3.5},
    {0:5,1:1.2},
    {0:7,1:3.3},
    {0:10.3,1:0.3},
]
problem = svm_problem(labels, train)
test = [
    ({0:3, 1:3.11},1),
    ({0:7.3,1:3.1},3),
    ({0:7,1:3.3},3),
    ({0:9.8,1:0.5},4),
]

# Test classifiers.
kernels = [LINEAR, POLY, RBF]
kname = ['linear','polynomial','rbf']
correct = defaultdict(int)
for kn,kt in zip(kname,kernels):
    print kt
    param = svm_parameter(kernel_type = kt, C=10, probability = 1)
    model = svm_model(problem, param)
    for test_sample,correct_label in test:
        pred_label, pred_probability = model.predict_probability(test_sample)
        correct[kn] += pred_label == correct_label

# Show results.
print '-'*80
print 'Accuracy:'
for kn,correct_count in correct.iteritems():
    print '\t',kn, '%.6f (%i of %i)' % (correct_count/float(len(test)), correct_count, len(test))

The domain seems fairly simple. I'd expect that if it's trained to know a mean of 2.5 means label 1, then when it sees a mean of 2.4, it should return label 1 as the most likely classification. However, each kernel has an accuracy of 0%. Why is this?

A couple of side notes, is there a way to hide all the verbose training output dumped by libsvm in the terminal? I've searched libsvm's docs and code, but I can't find any way to turn this off.

Also, I had wanted to use simple strings as the keys in my sparse dataset (e.g. {'mean':2.5,'stddev':3.5}). Unfortunately, libsvm only supports integers. I tried using the long integer representation of the string (e.g. 'mean' == 1109110110971110), but libsvm seems to truncate these to normal 32-bit integers. The only workaround I see is to maintain a separate "key" file that maps each string to an integer ('mean'=0, 'stddev'=1). But obviously that'll be a pain since I'll have to maintain and persist a second file along with the serialized classifier. Does anyone see an easier way?

+2  A: 

If you are interested in a different way of doing this, you could do the following. This way is theoretically more sound, however not as straightforward.

By mentioning mean and std, it seems as if you refer to data that you assume to be distributed in some way. E.g., the data you observer is Gaussian distributed. You can then use the Symmetrised Kullback-Leibler_divergence as a distance measure between those distributions. You can then use something like k-nearest neighbour to classify.

For two probability densities p and q, you have KL(p, q) = 0 only if p and q are the same. However, KL is not symmetric - so in order to have a proper distance measure, you can use

distance(p1, p2) = KL(p1, p2) + KL(p1, p2)

For Gaussians, KL(p1, p2) = { (μ1 - μ2)^2 + σ1^2 - σ2^2 } / (2.σ2^2) + ln(σ2/σ1). (I stole that from here, where you can also find a deviation :)

Long story short:

Given a training set D of (mean, std, class) tuples and a new p = (mean, std) pair, find that q in D for which distance(d, p) is minimal and return that class.

To me that feels better as the SVM approach with several kernels, since the way of classifying is not so arbitrary.

bayer
Thanks. I figured there was probably something better than a SVM for normal/gaussian distributions. However, I also intend to include these guassian features with other arbitrary features, so k-nn using a specialized distance measure would not be appropriate.
Chris S
There are actually ways to learn such distance measures from class labels. Maybe you want to checkout Sam Roweis' work on Neighbourhood Component Analysis.
bayer
+3  A: 

The problem seems to be coming from combining multiclass prediction with probability estimates.

If you configure your code not to make probability estimates, it actually works, e.g.:

<snip>
# Test classifiers.
kernels = [LINEAR, POLY, RBF]
kname = ['linear','polynomial','rbf']
correct = defaultdict(int)
for kn,kt in zip(kname,kernels):
  print kt
  param = svm_parameter(kernel_type = kt, C=10) # Here -> rm probability = 1
  model = svm_model(problem, param)
  for test_sample,correct_label in test:
      # Here -> change predict_probability to just predict
      pred_label = model.predict(test_sample)
      correct[kn] += pred_label == correct_label
</snip>

With this change, I get:

--------------------------------------------------------------------------------
Accuracy:
        polynomial 1.000000 (4 of 4)
        rbf 1.000000 (4 of 4)
        linear 1.000000 (4 of 4)

Prediction with probability estimates does work, if you double up the data in the training set (i.e., include each data point twice). However, I couldn't find anyway to parametrize the model so that multiclass prediction with probabilities would work with just the original four training points.

dmcer