views:

105

answers:

4

I created a heuristic (an ANN, but that's not important) to estimate the probabilities of an event (the results of sports games, but that's not important either). Given some inputs, this heuristics tell me what are the probabilities of the event. Something like : Given theses inputs, team B as 65% chances to win.

I have a large set of inputs data for which I now the result (games previously played). Which formula/metric could I use to qualify the accuracy of my estimator.

The problem I see is, if the estimator says the event has a probability of 20% and the event actually do occurs. I have no way to tell if my estimator is right or wrong. Maybe it's wrong and the event was more likely than that. Maybe it's right, the event as about 20% chance to occur and did occur. Maybe it's wrong, the event has really low chances to occurs, say 1 in 1000, but happened to occur this time.

Fortunately I have lots of theses actual test data, so there is probably a way to use them to qualify my heuristic.

anybody got an idea?

+1  A: 

In a way it depends on the decision function you are using.

In the case of a binary classification task (predicting whether an event occurred or not [ex: win]), a simple implementation is to predict 1 if the probability is greater than 50%, 0 otherwise.

If you have a multiclass problem (predicting which one of K events occurred [ex: win/draw/lose]), you can predict the class with the highest probability.

And the way to evaluate your heuristic is to compute the prediction error by comparing the actual class of each input with the prediction of your heuristic for that instance.

Note that you would usually divide your data into train/test parts to get better (unbiased) estimates of the performance.

Other tools for evaluation exist such as ROC curves, which is a way to depict the performance with regard to true/false postitives.

Amro
I'm not creating a binary classification heuristic, nor am I creating a multiclass classifier. The result of my heuristic needs to be a probability, predicting how likely is an event.
Mathieu Pagé
yes but you test it against data that has the actual realization of the event (did team B actually won or not), and you want your probabilities to be as close as possible to the actual output (ideally u want won=100% lost=0% for game where team B won)
Amro
Not necessarily. If the underlying process is "truly random", trying to predict the outcome of each event makes no sense. If I gave you an unbalanced coin and asked you to give me the probability that it lands heads, you would measure the actual frequency of heads, and your estimator would be that probability. You would not be able to find a model which predicts the outcome for an individual toss.
Mathias
+1  A: 

How you test an estimator is by giving it as much test data as you can and seeing how good it predicts. So as the first answer explains, divide your data into training, validation, and test data. So if you have Model 1,...,Model N in your hands, you train each model on the training data. You then use the validation data to choose the best model. You finally test your ModelX (which performed the best on the validation data) on the test data. You probably don't need the validation step if you have already picked one model (which is so in your case I think).

Randomly shuffle your data and split it into training/validation/test. Do not mix datasets once you split them. Strictly use each for the purpose it is prepared for (many people don't understand how crucial this is!). There are some cases where you use the training data for testing and the other way round. But this applies if you have small data.

There is no hard and fast rule to decide how much data you use for training/testing. This is a research area itself. For example two factors are how much data you have available and how complex your model is.

See the following on how to evaluate binary classifiers,

http://en.wikipedia.org/wiki/Binary%5Fclassifier#Evaluation%5Fof%5Fbinary%5Fclassifiers

tilish
+1  A: 

As you stated, if you predict that an event has a 20% of happening - and 80% not to happen - observing a single isolated event would not tell you how good or poor your estimator was. However, if you had a large sample of events for which you predicted 20% success, but observe that over that sample, 30% succeeded, you could begin to suspect that your estimator is off.
One approach would be to group your events by predicted probability of occurrence, and observe the actual frequency by group, and measure the difference. For instance, depending on how much data you have, group all events where you predict 20% to 25% occurrence, and compute the actual frequency of occurrence by group - and measure the difference for each group. This should give you a good idea of whether your estimator is biased, and possibly for which ranges it's off.

Mathias
This seems a good idea, and a simple one. I don't know if I have enough data to create enough groups, but if it's the case it would be promising.
Mathieu Pagé
+3  A: 

There are a number of measurements that you could use to quantify the performance of a binary classifier.

Do you care whether or not your estimator (ANN, e.g.) outputs a calibrated probability or not?

If not, i.e. all that matters is rank ordering, maximizing area under ROC curve (AUROC) is a pretty good summary of the performance of the metric. Others are "KS" statistic, lift. There are many in use, and emphasize different facets of performance.

If you care about calibrated probabilities then the most common metrics are the "cross entropy" (also known as Bernoulli probability/maximum likelihood, the typical measure used in logistic regression) or "Brier score". Brier score is none other than mean squared error comparing continuous predicted probabilites to binary actual outcomes.

Which is the right thing to use depends on the ultimate application of the classifier. For example, your classifier may estimate probability of blowouts really well, but be substandard on close outcomes.

Usually, the true metric that you're trying to optimize is "dollars made". That's often hard to represent mathematically but starting from that is your best shot to coming up with an appropriate and computationally tractable metric.

Matt Kennel
Thanks a lot Matt, I'll lookup the probabilities you suggest, but I think you really understood what I'm trying to do. That is output calibrated probabilities.
Mathieu Pagé
It looks like the Brier score is what I was looking for. Thanks!
Mathieu Pagé