views:

325

answers:

3

I'm evaluating a number of different algorithms whose job is to predict the probability of an event occurring.

I am testing the algorithms on large-ish datasets. I measure their effectiveness using "Root Mean Squared Error", which is the square root of the ((sum of the errors) squared). The error is the difference between the predicted probability (a floating point value between 0 and 1) and the actual outcome (either 0.0 or 1.0).

So I know the RMSE, and also the number of samples that the algorithm was tested on.

The problem is that sometimes the RMSE values are quite close to each-other, and I need a way to determine whether the difference between them is just chance, or if it represents an actual difference in performance.

Ideally, for a given pair of RMSE values, I'd like to know what the probability is that one is really better than the other, so that I can use this probability as a threshold of significance.

+4  A: 

The MSE is an average and hence the central limit theorem applies. So testing whether two MSEs are the same is the same as testing whether two means are equal. A difficulty compared to a standard test comparing two means is that your samples are correlated -- both come from the same events. But a difference in MSE is the same as a mean of differenced squared errors (means are linear). This suggests calculating a one-sample t-test as follows:

  1. For each x compute a error e for procedure 1 and 2.
  2. Compute differences of squared errors (e2^2-e1^2).
  3. Compute the mean of the differences.
  4. Compute the standard deviation of the differences.
  5. Compute a t-statistic as mean/(sd/sqrt(n)).
  6. Compare your t-statistic to a critical value or compute a p-value. For instance, reject equality at 5% confidence level if |t|>1.96.

The RMSE is a monotonic transformation of MSE so this test shouldn't give substantively different results. But be careful not to assume that MRSE is RMSE.

A bigger concern should be overfitting. Make sure to compute all your MSE statistics using data that you did not use to estimate your model.

Tristan
Its a little awkward to total the differences of squared errors, as I actually need to test more than 2 algorithms at once. Is there a way to do this where the inputs are the number of tests (n), and the two RMSE values?
sanity
The advantage of differencing first is that you avoid the need to compute covariance terms. The problem is that cov(e1,e2)!=0, so a standard t-test won't work. If you just have two RMSE values you don't know the covariance, so you won't be able to compute any test statistic easily. If you can run these procedures many times, you might want to consider bootstrapping. That will allow you to easily use any statistic you want in a statistically defensible but simple fashion.
Tristan
What do you mean by "bootstrapping"? I guess I didn't really state the problem very well. Basically I can have hundreds of algorithms and I need to identify the best, so I don't know which algorithm must be compared to each other until I've already ran the test and recorded the results :-/
sanity
If you are interested in the properties (e.g. variance) of any statistic (e.g, difference in RMSE), you can resample your data many times, computing the statistic for each sample. The distribution of these statistics approximates the sampling distribution of your statistics. It is a very powerful and simple idea for frequentist inference: http://bit.ly/3AVsZR
Tristan
+4  A: 

You are entering into a vast and contentious area of not only computation but philosophy. Significance tests and model selection are subjects of intense disagreement between the Bayesians and the Frequentists. Triston's comment about splitting the data-set into training and verification sets would not please a Bayesian.

May I suggest that RMSE is not an appropriate score for probabilities. If the samples are independent, the proper score is the sum of the logarithms of the probabilities assigned to the actual outcomes. (If they are not independent, you have a mess on your hands.) What I am describing is scoring a "plug-in" model. Proper Bayesian modeling requires integrating over the model parameters, which is computationally extremely difficult. A Bayesian way to regulate a plug-in model is to add a penalty to the score for unlikely (large) model parameters. That's been called "weight decay."

I got started on my path of discovery reading Neural Networks for Pattern Recognition by Christopher Bishop. I used it and and Practical Optimization by Gill, et al to write software that has worked very well for me.

Jive Dadson
Commenting on my own comment. "The proper score" is too strong a term. That score is often appropriate, but you may need a cost score that takes into account the gain from proper classification and the loss from false classifications. It's a deep subject. Don't depend on a model for important decisions unless you REALLY know what you are doing, and you REALLY understand how your data set relates to the distribution of the data you will be evaluating.Dismounting the soapbox now.
Jive Dadson
I'm fully Bayesian. But given the question, a frequentist t-test using out-of-sample errors is certainly statistically defensible and should give good answers. It's not clear the Bayesian solution is at all feasible; the questioner is using existing, non-Bayesian, estimation procedures and seems to care about MSE not a model fit criteria.
Tristan
@Tristan. Understood. But is MSE reasonable for probability outputs? Is there an analog Student's T for a logarithmic error score?
Jive Dadson
@Jive. I'm not exactly sure what you're recommending, although I'm sympathetic to your general point. I assume it relates to the likelihood and something like BIC. But you're going to need a degrees of freedom correction, otherwise the best model would just return exactly the data. BIC-type measures have major issues too.
Tristan
@Tristan. He said, "The error is the difference between the predicted probability (a floating point value between 0 and 1) and the actual outcome (either 0.0 or 1.0)." Problem is, that is not a reasonable error score for probabilities. (The outcomes are yes/no events, not measurements with gaussian noise.) Making all the usual assumptions, the error is -sum log(Pi), where the Pi's are the probabilities assigned to the events that actually occurred. I don't think that's insignificant.
Jive Dadson
[Pardon me. I am fighting with the software.]@Tristan. BIC is silly. It falls apart entirely as the number of model parameters grows. But I'm not even talking about model selection yet. I'm only taking issue with the numbers that he intends to use in the selection process.
Jive Dadson
(Continued) He said, "The error is the difference between the predicted probability (a floating point value between 0 and 1) and the actual outcome (either 0.0 or 1.0)." Problem is, that is not a reasonable error score for probabilities. (The outcomes are yes/no events, not measurements with gaussian noise.) Making all the usual assumptions, the error is -sum log(Pi), where the Pi's are the probabilities assigned to the events that actually occurred. I don't think that's insignificant.
Jive Dadson
(Continued again) Using squared differences, four predictions of 50% probability would have the same penalty as one prediction of ZERO probability that came to pass. The first result might be quite good. The second one should disqualify the model absolutely (and result in a floating point error. :-))
Jive Dadson
I agree that a better measure would be to make the prediction binary and attach utilities to each outcome. But MSE is not automatically invalid because it is continuous. A finance application might be selling a contract for a 0/1 bet. If pi is your estimate of a fair bet, you would sell the contract for pi and earn x - pi (not 0 or 1). Add squared utility over income and you have MSE.
Tristan
x-pi, I understand. What I do not understand is why one would square it. What justification is there for that? If the finance application's goal is to maximize portfolio growth by betting FRACTIONS of the portfolio's capital, then it wants to maximize the LOGARITHM (Kelly Criterion), not the square. What's with the squaring? :-)
Jive Dadson
Yes, that last part of what I said doesn't make sense. Ultimately MSE is a contract that has squared losses over the difference x-pi in either direction. That's a possible utility function, but a strange one that doesn't fit the example I gave.
Tristan
@Jive, I'm very interested in your comment that RMSE is not a good way to measure scores. I ended up using it simply for lack of anything else. Bearing in mind that math is not a strong suit, can you give me a specific pointer as to the justification for this? I've looked at the books you recommended, but I fear that both may be beyond my understanding of mathematics (basically highschool), and they are both very expensive :-(
sanity
Just to chime in on the debate about the best metric. The actual application is that I want to select what ad to show, based on the probability my algorithm thinks the ad will be clicked. Thus, the benefit is proportional to the difference between the best ad as selected by algorithm A, versus the best ad as selected by algorithm B. One metric I've considered is taking, say, the top 10% of predicted probabilities for each algorithm, and then counting how many ads were clicked in this percentile for each ad. Which every one got the most clicks among their top-predicted probabilities wins.
sanity
I should add that the advertising application is just one of several applications for the stuff I'm working on, eventually I realize I'll need to provide pluggable scoring algorithms (and already have the infrastructure for this).
sanity
A: 

I am responding here to questions in the comments. The subject is far too big to handle in comments.

Cliff Notes version.

The types of scores we are talking about measure probabilities. (Whether that is appropriate for what you are doing is another question.) If you assume that the samples are independent, you get the "total" probability by simply multiplying all the probabilities together. But that usually results in absurdly small numbers, so equivalently, you add the logarithms of the probabilities. Bigger is better. Zero is perfect.

The ubiquitous -squared error, -x^2, where x is the model's error, comes from the (frequently unjustified) assumption that the training data comprise observations (measurements) corrupted with "Gaussian noise." If you look in Wikipedia or something at the definition of a Gaussian (aka normal) distribution, you'll find that it contains the term e^(-x^2). Take the natural logarithm of that, and voila!, -x^2. But your models do not produce most-likely "pre-noise" values for measurements. They produce probabilities directly. So the thing to do is simply to add the logarithms of the probabilities assigned to the observed events. Those observations are assumed to be noise-free. If the training data says it happened, it happened.

Your original question remains unanswered. How to tell if two models differ "significantly"? That is a vague and difficult question. It is the subject of much debate and even emotion and rancor. It's also not really the question you want answered. What you want to know is which model gives you the best expected profit, all things considered, including how much each software package costs, etc.

I'll have to break this off soon. This is not the place for a course on modeling and probability, and I am not really qualified as the professor.

Jive Dadson