views:

410

answers:

3

I'm using the explorer feature of Weka for classification.

So I have my .arff file, with 2 features of NUMERIC value, and my class is a binary 0 or 1 (eg {0,1}).

Sample:

@RELATION summary
@ATTRIBUTE feature1 NUMERIC
@ATTRIBUTE feature2 NUMERIC
@ATTRIBUTE class {1,0}

@DATA
23,11,0
20,100,1
2,36,0
98,8,1
.....

I load this .arff file, use 10-fold cross validation (no test file), and choose NaiveBayes, then I classify the data, and it gives me: 5 incorrectly labeled, 100 correctly labeled. So far so good.

Now, I significantly change my .arff file (give completely random values for my feature attributes). And repeat the above, and I get the EXACT same statistics when classifying.

I tried this with more changes to my .arff file, different classification algorithms. Still, EXACT same statistic (within the same algorithm) no matter what values I give to my .arff file.

Am I doing something wrong here?

+2  A: 

It's hard to tell without more information, but I have two suggestions:

  1. What are the relative proportions of the two classes? Is it 5 to 100? Lots of algorithms don't work well with highly skewed class label distributions.

  2. Just a hunch, but try changing your class labels from numbers to strings (e.g. 'class1' and 'class2'). Weka calls these 'nominal' attributes so maybe using numbers is not allowed.

StompChicken
+1  A: 

Also: keep in mind that cross validation is pretty horrid in the UI as they only show you the original tree, anyhow (before they fold in other data). If you want the final trees generated, you need the programmatic API. I suggest using a split training/test data set.

James
A: 

Have you tried to change

@ATTRIBUTE class {1,0} 

with

@ATTRIBUTE class {yes,no}