views:

113

answers:

4

I need to evaluate the effectiveness of algorithms which predict the probability of something occurring.

My current approach is to use "root mean squared error", ie. the square root of the mean of the errors squared, where the error is 1.0-prediction if the event occurred, or prediction if the event did not occur.

The algorithms have no specific applications, but a common one will be to come up with a prediction of an event occurring for each of a variety of options, and then selecting the option that maximizes this probability. The benefit to us is directly proportional to the rate at which the desired event occurs among the options that have the highest predicted probabilities.

It has been suggested that RMSE may not be the best option for this, and I'm interested in the opinions of others.

A: 

I am not sure that I understand your question, so this answer may not be useful to you.

Question:
How can one test whether an algorithm for calculating the the probability of a system being in given a state be tested against the actual probability.

Presumably this is a system where there are one or more probabilistic initial states that interact to produce a final state, and the distribution of the initial states is known.

This is a question that ofter comes up when trying to estimate the error in a calculation for scientific calculations.

Answer:
One way to approach this problem is to use what is called a Monte Carlo simulation.

To do this you choose a large number of initial states distributed according to the initial probability distributions. For each initial state you calculate the final state of the system. By looking at the distribution of final states you can determine the probably of the final state having a particular value.

You can now compare the results of the simulation with the results of your algorithm.

While the above description might sound technical, these are rather easy to write in practice. You can find a number of tutorials on the web, though most use a Monte Carlo technique for a slightly different problem.

Example:

Suppose you are looking at a system where a number of coins are being tossed. You want to know the probability of two out of the total coins being flipped both ending heads up.

You could write an algorithm that would predict this possibility with the options being the number of coins tossed. (There is of course an exact calculation for this probability.)

To do the simulation you first set up a large number of initial states. In each initial state you randomly choose for each coin whether it is head or tails up. You now count the number of results where two of the coins are heads up, and compare it with your prediction.

amicitas
The issue is that I have a bunch of data where I know what happened, and I know what I predicted would happen. How do I turn this into some kind of "score" that can be used to evaluate algorithms relative to each-other?
sanity
Ah. The standard metric that I use is the sum of the errors squared. This is similar to the root mean error squared but simpler to interpret. I think that this is generally a good metric use and allows you to weight the errors when appropriate.Do you have a concrete example?
amicitas
+1  A: 

A chi-square test is a widely used distribution fitness test:

∑ (Oi - Ei)2/Ei

where Oi is the observed frequency of outcome i and Ei is the expected frequency. This chi-square test requires a minimal sample size (~ 5 or 10, depending on the distribution, particularly the degrees of freedom of the distribution) for each possible outcome. If the sample size requirement isn't met, you need to apply Yates' correction:

∑ (|Oi - Ei| - 0.5)2/Ei

Disclaimer: I'm not a statistician. The above probably misses some of the finer points. I know there's a good reason to use chi-square over RMSE, but I can't remember what it is.

Look for webpages that discuss hypothesis testing.

outis
A: 

Have a look at ROC curves aka Receiver operating characteristics.

To quote from the Wikipedia page:

In signal detection theory, a receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot of the sensitivity vs. (1 − specificity) for a binary classifier system as its discrimination threshold is varied. The ROC can also be represented equivalently by plotting the fraction of true positives (TPR = true positive rate) vs. the fraction of false positives (FPR = false positive rate). Also known as a Relative Operating Characteristic curve, because it is a comparison of two operating characteristics (TPR & FPR) as the criterion changes.[1]

ROC analysis provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution. ROC analysis is related in a direct and natural way to cost/benefit analysis of diagnostic decision making. The ROC curve was first developed by electrical engineers and radar engineers during World War II for detecting enemy objects in battle fields, also known as the signal detection theory, and was soon introduced in psychology to account for perceptual detection of signals. ROC analysis since then has been used in medicine, radiology, and other areas for many decades, and it has been introduced relatively recently in other areas like machine learning and data mining.

It is actually easier than it sounds and makes comparisons easy -- 'better' methods will visually dominate the ROC curve of an inferior method.

R has a number of packages for this.

Dirk Eddelbuettel
A: 

It sounds like you're predicting the outcome of something that takes on the value of 0 or 1, right? If so, you might look into a discussion of discrete choice modeling. The word "choice" shouldn't be taken too literally. While most discrete choice models are designed around explaining choices people make every day - buy this product or that, take the train or drive, take one route to work or another - the same models have been applied successfully to dog racing and horse racing.

The key texts on the subject are by Ben-Akiva & Lerman and Kenneth Train. Also look for "Logit models" for info on specifying and fitting these statistical models.

Grembo