Correcting for outliers in a running average

A:

Maybe a good method would be to ignore any results that are more than some defined value outside the current running average?

1800 INFORMATION 2009-04-12 07:26:45

Yes, but how do you say what this "defined value" is?

Edward Z. Yang 2009-04-12 07:30:15

I expect that would come from an examination of the data based on the actual results

1800 INFORMATION 2009-04-12 07:38:14

I really would like to avoid hard-coding something like that in the program

Edward Z. Yang 2009-04-12 07:41:54

It could be a configuration parameter?

1800 INFORMATION 2009-04-12 09:45:46

+2 A:

The definition of what constitutes an abnormal value must scale to the data itself. The classic method of doing this is to calculate the z score of each of the data points and throwing out any values greater than 3 z scores from the average. The z score can be found by taking the difference between the data point and the average and dividing by the standard deviation.

ojblass 2009-04-12 07:35:41

How well would this method work for the pathological blue line case?

Edward Z. Yang 2009-04-12 07:45:48

The pathological blue line case has a high standard deviation. It would take a significant outlying value to get rejected.

ojblass 2009-04-12 07:56:10

A:

You need to have some idea of expected variation or distribution, if you want to be able to exclude certain (higher) instances of variation as erroneous. For instance, if you can approximate the distribution of the "average times" result to a normal (Gaussian) distribution, then you can do what ojblass suggested and exclude those results that exhibit a variation that is greater than some multiple of the standard deviation (which can be calculated on the fly alongside your running average). If you wanted to exclude results that have a 99.75 (or so) percent chance of being erroneous, exclude those that vary more than 3 standard deviations from the mean. If you only want 95% certainty, exclude those that vary more than 2 standard deviations and so on.

I'm sure a little bit of googling for "standard deviation" or "gaussian distribution" will help you. Of course, this assumes that you expect a normal distribution of results. You might not. In which case, the first step would be to guess at what distribution you expect.

ozan 2009-04-12 07:53:27

+2 A:

If that example graph you have is typical, then any of the criteria you list will work. Most of those statistical methods are for riding the edge of errors right at the fuzzy level of "is this really an error?" But your problem looks wildly simple.. your errors are not just a couple standard deviations from the norm, they're 20+. This is good news for you.

So, use the simplest heuristic. Always accept the first 5 points or so in order to prevent a startup spike from ruining your computation. Maintain mean and standard deviation. If your data point falls 5 standard deviations outside the norm, then discard it and repeat the previous data point as a filler.

If you know your typical data behavior in advance you may not even need to compute mean and standard deviation, you can hardwire absolute "reject" limits. This is actually better in that an initial error won't blow up your detector.

SPWorley 2009-04-12 07:57:36

Excellent! One thing though; although we do know the typical behavior, it is possible that the hardware will get switched out, so I feel that an adaptable program would be preferred.

Edward Z. Yang 2009-04-12 08:04:14

Even if it does need to be adaptable, put some EXTREME limits in there anyway. If a hardware glitch gives you 1e280 as a value, or NaN or +Inf, you may want to filter those out regardless.

SPWorley 2009-04-12 08:22:16

A:

The naive (and possibly best) answer to the bootstrapping question is "Accept the first N values without filtering." Choose N to be as large as possible while still allowing the setup time to be "short" in your application. In this case, you might consider using the widow width (64 samples) for N.

Then I would go with some kind of mean and sigma based filter.

dmckee 2009-04-12 15:22:27

A:

I would compute a running median (robust alternative to mean) and a running mad (robust alternative to sd), remove everything that more than 5 mad's away from the median http://epp.eurostat.ec.europa.eu/portal/page/portal/research_methodology/documents/S4P1_MIRROROUTLIERDETECTION_LIAPIS.pdf

2009-06-24 12:16:27

ansaurus

tags:

views:

answers:

Correcting for outliers in a running average

related questions