ansaurus

Question

Algorithms to eliminate or display outliers in a dataset.

Answer 1

+4 A:

Outliers are usually removed for analysis (fitting a model), not for visualization, where you can just change your focus so the outliers are out of view.

The plots you mention, parallel coordinate chart and a scatterplot matrix, should be fine since each variable has its own scale. If your software forces all scales to be the same, the usual trick is to "standardize" the values by subtracting the mean and dividing by the standard deviation.

However, doing anything with standard deviation counts as analysis (so you may need to remove outliers after all) and assumes a Normal distribution. So first perform a distribution analysis of each variable -- at least look at a histogram and preferably do some kind of goodness of fit test against a normal curve. If the distribution is more like a Log Normal or something else, you can apply a transform to the data to make it Normal for better visualization (or use a log scale in the visualization).

Removing outliers is a bit of an art and should require knowledge of the data (because often the outliers are the interesting parts). The most common definition of outliers for box plots is points beyond 1.5 times the inter-quartile range from the box quartiles, which is usually farther than two standard deviation.

Another technique is often called "robust" fitting. You fit a Normal curve to the data and remove points from the tails that where the value of the fitted curve is very small. Repeat those two steps until it stabilizes (usually 1 to 3 times).

xan 2009-12-05 23:10:26

Answer 2

A:

Here is how I would go about doing it in SQL Server.

The query below will get the average weight from a fictional Scale table holding a single weigh-in for each person while not permitting those who are overly fat or thin to throw off the more realistic average:

select w.Gender, Avg(w.Weight) as AvgWeight
    from ScaleData w
    join ( select d.Gender, Avg(d.Weight) as AvgWeight, 
                  2*STDDEVP(d.Weight) StdDeviation
             from ScaleData d
            group by d.Gender
         ) d
      on w.Gender = d.Gender
     and w.Weight between d.AvgWeight-d.StdDeviation 
                      and d.AvgWeight+d.StdDeviation
   group by w.Gender

There may be a better way to go about this, but it works and works well. If you have come across another more efficient solution, I’d love to hear about it.

NOTE: the above removes the top and bottom 5% of outliers out of the picture for purpose of the Average. You can adjust the number of outliers removed by adjusting the 2* in the 2*STDDEVP as per: http://en.wikipedia.org/wiki/Standard_deviation

Jason Southwell 2010-02-17 20:39:12

ansaurus

tags:

views:

answers:

Algorithms to eliminate or display outliers in a dataset.

related questions