views:

326

answers:

2

Dear all,

I am currently strugeling with a machine learning problem whereas I have to deal with great unbalanced data sets. That is, there are six classes ('1','2'...'6'). Unfortunately there are e.g. for class '1' 150 examples/instances, for '2' 90 instances and for class '3' only 20. All other classes can't be "trained" since there are no available instances for these classes.

So far, I figured out that WEKA (the machine learning toolkit I am using) provides this supervised "Resample" filter. When I apply this filter with 'noReplacement'=false and 'bialToUniformClass'=1.0 then this results in a data set, where the the number of instances is nice and almost equal (for class '1'..'3' and the others stay empty).

My question is now: how does WEKA and this filter generate "new"/additional instances for different classes.

Thank you very much in advance for any hints or suggestions.

Cheers Julian

+1  A: 

The answer is pretty surprising:

Using WEKA's supervised Resample filter adds instances to a class. This realized by simply adding instances from the class which has only few instances multiple times to the result data set.

Therefore the resulting data set is strongly biased in terms of a class for which only few samples are available.

Cheers

Julian

Julian
A: 

It doesn't. It's resampling existing instances. If you have one class-2 instance, and ask for a resampling with a bias of 1.0, you can expect N copies of that instance and N other instances of each other type for which there is already data.

James