views:

433

answers:

2

I've got a classification problem in my hand, which I'd like to address with a machine learning algorithm ( Bayes, or Markovian probably, the question is independent on the classifier to be used). Given a number of training instances, I'm looking for a way to measure the performance of an implemented classificator, with taking data overfitting problem into account.

That is: given N[1..100] training samples, if I run the training algorithm on every one of the samples, and use this very same samples to measure fitness, it might stuck into a data overfitting problem -the classifier will know the exact answers for the training instances, without having much predictive power, rendering the fitness results useless.

An obvious solution would be seperating the hand-tagged samples into training, and test samples; and I'd like to learn about methods selecting the statistically significant samples for training.

White papers, book pointers, and PDFs much appreciated!

+7  A: 

You could use 10-fold Cross-validation for this. I believe it's pretty standard approach for classification algorithm performance evaluation.

The basic idea is to divide your learning samples into 10 subsets. Then use one subset for test data and others for train data. Repeat this for each subset and calculate average performance at the end.

Mr. Brownstone
http://en.wikipedia.org/wiki/Root-mean-square_error_of_cross-validation#K-fold_cross-validation(links directly to k-fold cross validation within the wiki article you linked)
JoeCool
+1  A: 

As Mr. Brownstone said 10-fold Cross-Validation is probably the best way to go. I recently had to evaluate the performance of a number of different classifiers for this I used Weka. Which has an API and a load of tools that allow you to easily test the performance of lots of different classifiers.

Mark Davidson