views:

176

answers:

2

I have a binary class dataset (0 / 1) with a large skew towards the "0" class (about 30000 vs 1500). There are 7 features for each instance, no missing values.

When I use the J48 or any other tree classifier, I get almost all of the "1" instances misclassified as "0".

Setting the classifier to "unpruned", setting minimum number of instances per leaf to 1, setting confidence factor to 1, adding a dummy attribute with instance ID number - all of this didn't help.

I just can't create a model that overfits my data!

I've also tried almost all of the other classifiers Weka provides, but got similar results.

Using IB1 gets 100% accuracy (trainset on trainset) so it's not a problem of multiple instances with the same feature values and different classes.

How can I create a completely unpruned tree? Or otherwise force Weka to overfit my data?

Thanks.

Update: Okay, this is absurd. I've used only about 3100 negative and 1200 positive examples, and this is the tree I got (unpruned!):

J48 unpruned tree
------------------

F <= 0.90747: 1 (201.0/54.0)
F > 0.90747: 0 (4153.0/1062.0)

Needless to say, IB1 still gives 100% precision.

Update 2: Don't know how I missed it - unpruned SimpleCart works and gives 100% accuracy train on train; pruned SimpleCart is not as biased as J48 and has a decent false positive and negative ratio.

+2  A: 

The quick and dirty solution is to resample. Throw away all but 1500 of your positive examples and train on a balanced data set. I am pretty sure there is a resample component in Weka to do this.

The other solution is to use a classifier with a variable cost for each class. I'm pretty sure libSVM allows you to do this and I know Weka can wrap libSVM. However I haven't used Weka in a while so I can't be of much practical help here.

StompChicken
Thanks. I'm not sure resampling would work - from experiments I made it seems that even on a rather balanced dataset (1000 examples for each class) J48 and other classifiers (except SimpleCart) get ridiculous results - either very high FP or FN for class "0" or very high for class "1" (and the other class is classified mostly correctly). Regarding the cost sensitive classification - I totally forgot about it, I'll look into it soon. Thank you!
Haggai
The cost sensitive approach worked. I still don't understand why unpruned J48 won't give me 100% accuracy on the training set, or why a rather balanced dataset with J48 still gave ridiculous outputs. But at least now I have something to work with. Thanks!
Haggai
+2  A: 

Weka contains metaclassifiers: weka.classifiers.meta.CostSensitiveClassifier and weka.classifiers.meta.MetaCost. They allows you to make any algorithm cost-sensitive (not restricted to SVM) and to specify a cost matrix (penalty of the various errors); you would give a higher penalty for misclassifying 1 instances as 0 than you would give for erroneously classifying 0 as 1.

The result is that the algorithm would then try to: "minimize expected misclassification cost (rather than the most likely class)"

Amro
Thanks, that's exactly the solution I've used.
Haggai