views:

308

answers:

5

Suppose I'm working on some classification problem. (Fraud detection and comment spam are two problems I'm working on right now, but I'm curious about any classification task in general.)

  1. How do I know which classifier I should use? (Decision tree, SVM, Bayesian, logistic regression, etc.) In which cases is one of them the "natural" first choice, and what are the principles for choosing that one?

Examples of the type of answers I'm looking for (from Manning et al.'s "Introduction to Information Retrieval book": http://nlp.stanford.edu/IR-book/html/htmledition/choosing-what-kind-of-classifier-to-use-1.html):

a. If your data is labeled, but you only have a limited amount, you should use a classifier with high bias (for example, Naive Bayes). [I'm guessing this is because a higher-bias classifier will have lower variance, which is good because of the small amount of data.]

b. If you have a ton of data, then the classifier doesn't really matter so much, so you should probably just choose a classifier with good scalability.

  1. What are other guidelines? Even answers like "if you'll have to explain your model to some upper management person, then maybe you should use a decision tree, since the decision rules are fairly transparent" are good. I care less about implementation/library issues, though.

  2. Also, for a somewhat separate question, besides standard Bayesian classifiers, are there 'standard state-of-the-art' methods for comment spam detection (as opposed to email spam)?

[Not sure if stackoverflow is the best place to ask this question, since it's more machine learning than actual programming -- if not, any suggestions for where else?]

A: 

My take on it is that you always run the basic classifiers first to get some sense of your data. More often than not (in my experience at least) they've been good enough.

So, if you have supervised data, train a Naive Bayes classifier. If you have unsupervised data, you can try K-means clustering.

Another resource is a lecture video Stanford Machine Learning I watched a while back. In video 4 or 5 I think he discusses some generally accepted conventions when training classifiers, advantages/tradeoffs, etc.

aduric
+2  A: 

Model selection using Cross Validation may be what you need.

http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29

http://en.wikipedia.org/wiki/Model_selection

Cross Validation

What you do is simply to split your dataset into K non-overlapping subsets (folds), train a model using K-1 folds and predict its performance using the fold you left out. This you do for each possible combination of folds (first leave 1st fold out, then 2nd, .. , then kth and train with the remaining folds). After finishing you estimate the mean performance of all folds (maybe also the variance/standard deviation of the performance).

How to choose the parameter K depends on time you have. Usual Ks are 3,5,10 or even N, where N is the size of your data (thats the same as Leave-One-Out Cross Validation). I prefer 5 or 10.

Model Selection

Let's say you have 5 methods (ANN, SVM, KNN etc) and 10 parameter combinations for each method (depend on the method). You simply have to run Cross Validation for each method and parameter combination (5x10 = 50) and select the best model, method and parameters. Then you re-train with the best method and parameters on all your data and you have your final model!

Well, there are some more things to say. If for example you use a lot of methods and parameter combinations for each it's very likely you will overfit. In cases like these you have to use nested Cross Validation.

Nested Cross Validation

In nested Cross Validation you perform Cross Validation on the Model Selection algorithm. Again you first split your data into K folds. After each step you choose K-1 as your training data and the remaining one as your test data. Then you run Model Selection (the procedure I explained above) for each possible combination of those K folds. After finishing this you will have K models, one for each combination of folds. After that you test each model with the remaining test data and choose the best one. Again, after having the last model you train a new one with the same method and parameters on all the data you have. Thats your final model.

Of course there are many variations of these methods and other things I didn't mention. If you need more information about these look for some publications about these topics.

George B.
Yep, I know about cross validation -- I was wondering more about a priori reasons to select a certain classifier (and then I could use cross validation to tune some parameters, or to select between some smaller set of classifiers). Thanks, though!
LM
Well, I would use SVM or NN. I would also try variable selection to reduce the number of variables. A great algorithm for variable selection is the Max-Min Hill Climbing Bayesian Network structure learning algorithm (MMHC). (http://portal.acm.org/citation.cfm?id=1164587)
George B.
A: 

Sam Roweis used to say that you should that you should try naive Bayes, logistic regression, k-nearest neighbour and Fisher's linear discriminant before anything else.

bayer