tags:

views:

22

answers:

1

For LibSVM.

In 'A Practical Guide to Support Vector Classification' it is suggested that to use m number to represent an m-category attribute. For example {red, green, blue} can be represented as (0,0,1), (0,1,0), (1,0,0).

But in the README, it says value in the index:value pairs can only take a real number.

Anyone knows how to represent, say (0,0,1), in the data file?

A: 

This is not necessary for libsvm as it uses the one-against-one method of training SVMs (in fact in their documention they refer you to a research article in which one-against-one is compared to one-against-all and performs better). If you have 4 categories a,b,c,d libsvm actually creates 6 svms internally, one for a verses b, one for a verses c, one for a verses d, one for b versus d, one for c verses d. When asked to do classification it runs all 6 and uses a voting system to determine the winning category. This is actually better than just using (1,0,0,..) (0,1,0..) category inputs.

If you look at sample data sets in the libsvm data examples. You will see that you assign each category an integer and that integer is at the front of the entry, then each element of the vector of values that goes with that data element follows, e.g. if I have data in 5 classes and each data element has 3 vectors and data vector (3.3, 1.5, 0.5, 7.3, 3.5) belonged to class 4 a line of my data file would look like

4 1:3.3 2:1.5 3:0.5 4:7.3 5:3.5

This is really ugly, but I think it is because they are using a convention where zero vector entries are dropped, e.g. if the vector (.5,0,0,0,.7) was in category 2 the corresponding data line would be (I think)

2 1:.5 5:.7

The value of that (if I am correct) is that in some problems with very large amounts of data the vast bulk of the entries are zero.

John Robertson