views:

55

answers:

2

I have a several series of data points that need to be graphed. For each graph, some points may need to be thrown out due to error. An example is the following: alt text

The circled areas are errors in the data.

What I need is an algorithm to filter this data so that it eliminates the error by replacing the bad points with flat lines, like so:

alt text

Are there any algorithms out there that are especially good at detecting error points? Do you have any tips that could point me in the right direction?

EDIT: Error points are any points that don't look consistent with the data on both sides. There can be large jumps, as long as the data after the jump still looks consistent. If it's on the edge of the graph, large jumps should probably be considered error.

A: 

This is a problem that is hard to solve generically; your final solution will end up being very process-dependent, and unique to your situation.

That being said, you need to start by understanding your data: from one sample to the next, what kind of variation is possible? Using that, you can use previous data samples (and maybe future data samples) to decide if the current sample is bogus or not. Then, you'll end up with a filter that looks something like:

    const int MaxQueueLength = 100;           // adjust these two values as necessary
    const double MaxProjectionError = 5;

    List<double> FilterData(List<double> rawData)
    {
        List<double> toRet = new List<double>(rawData.Count);
        Queue<double> history = new Queue<double>(MaxQueueLength); // adjust queue length as necessary
        foreach (double raw_Sample in rawData)
        {
            while (history.Count > MaxQueueLength)
                history.Dequeue();
            double ProjectedSample = GuessNext(history, raw_Sample);
            double CurrentSample = (Math.Abs(ProjectedSample - raw_Sample) > MaxProjectionError) ? ProjectedSample : raw_Sample;
            toRet.Add(CurrentSample);
            history.Enqueue(CurrentSample);
        }
        return toRet;
    }

The magic, then, is coming up with your GuessNext function. Here, you'll be getting into stuff that is specific to your situation, and should take into account everything you know about the process that is gathering data. Are there physical limits to how quickly the input can change? Does your data have known bad values you can easily filter?

Here is a simple example for a GuessNext function that works off of the first derivative of your data (i.e. it assumes that your data is a roughly a straight line when you only look at a small section of it)

double lastSample = double.NaN;
double GuessNext(Queue<double> history, double nextSample)
{
    lastSample = double.IsNaN(lastSample) ? nextSample : lastSample;
    //ignore the history for simple first derivative.  Assume that input will always approximate a straight line
    double toRet = (nextSample + (nextSample - lastSample));
    lastSample = nextSample;
    return toRet;
}

If your data is particularly noisy, you may want to apply a smoothing filter to it before you pass it to GuessNext. You'll just have to spend some time with the algorithm to come up with something that makes sense for your data.

Your example data appears to be parametric in that each sample defines both a X and a Y value. You might be able to apply the above logic to each dimension independently, which would be appropriate if only one dimension is the one giving you bad numbers. This can be particularly successful in cases where one dimension is a timestamp, for instance, and the timestamp is occasionally bogus.

Drew Shafer
A: 

If removing the outliers by eye is not possible, try kriging (with error terms) as in http://www.ipf.tuwien.ac.at/cb/publications/pipeline.pdf . This seems to work quite well to automatically deal with occasional extreme noise. I know that French meteorologists use such an approach to remove outliers in their data (like a fire next to a temperature sensor or something kicking a wind sensor for instance).

Please note that it is a difficult problem in general. Any information about the errors is precious. Did someone kick the measuring device ? Then you cannot do much except removing the offending data by hand. Is your noise systematic ? You can do a lot of things then by making (reasonable) hypotheses about it.

Alexandre C.