I'm building a binary classification tree using mutual information gain as the splitting function. But since the training data is skewed toward a few classes, it is advisable to weight each training example by the inverse class frequency.
How do I weight the training data? When calculating the probabilities to estimate the entropy, do I take weighted averages?
EDIT: I'd like an expression for entropy with the weights.