views:

345

answers:

6

I have set of 200 data rows(implies a small set of data).I want to carry out some statistical analysis,before that i want to exclude outliers.What are the potential algos for the purpose,Accuracy is a matter of concern.
I am very new to Stats,so need help in very basic algos.

A: 

Compute the standard deviation on the set, and exclude everything outside of the first, second or third standard deviation.

Bear
Be aware that (for normally distributed data) ~1/3 of the data lies outside of one sigma, and ~1/10 outside of two sigma. Setting the limits too tightly will hurt your statistics and may mask systematic effects.
dmckee
-1 because the standard deviation and the mean will be distorted by the presence of outliers.
Kena
@ Kena. BFD, the poster requested to exclude outliers, and certainly the standard deviation will cause outliers to be excluded. That you shouldn't exclude outliers, or that the standard deviation will initially result in an awkward or less than ideal distribution is largely irrelevant.
Bear
+1  A: 

You may have heard the expression 'six sigma'.

This refers to plus and minus 3 sigma (ie, standard deviations) around the mean.

Anything outside the 'six sigma' range could be treated as an outlier.

On reflection, I think 'six sigma' is too wide.

This article describes how it amounts to "3.4 defective parts per million opportunities."

It seems like a pretty stringent requirement for certification purposes. Only you can decide if it suits you.

pavium
will this be more efficient than those so called Box plot and other technique
Sam Rudolph
*"3.4 defective parts per million opportunities."* In that case, the article's assumes +/-6 sigma, not +/-3 sigma.
dmckee
Yes, you're right, dmckee. I went back and looked. And the 99.99966% yield corresponds to 3.4 ppm. At least the article should be a helpful reference.
pavium
+6  A: 

Overall, the thing that makes a question like this hard is that there is no rigorous definition of an outlier. I would actually recommend against using a certain number of standard deviations as the cutoff for the following reasons:

  1. A few outliers can have a huge impact on your estimate of standard deviation, as standard deviation is not a robust statistic.
  2. The interpretation of standard deviation depends hugely on the distribution of your data. If your data is normally distributed then 3 standard deviations is a lot, but if it's, for example, log-normally distributed, then 3 standard deviations is not a lot.

There are a few good ways to proceed:

  1. Keep all the data, and just use robust statistics (median instead of mean, Wilcoxon test instead of T-test, etc.). Probably good if your dataset is large.

  2. Trim or Windsorize your data. Trimming means removing the top and bottom x%. Windsorizing means setting the top and bottom x% to the xth and 1-xth percentile value respectively.

  3. If you have a small dataset, you could just plot your data and examine it manually for implausible values.

  4. If your data looks reasonably close to normally distributed (no heavy tails and roughly symmetric), then use the median absolute deviation instead of the standard deviation as your test statistic and filter to 3 or 4 median absolute deviations away from the median.

dsimcha
+1 for noting that outliers will screw up your standard deviation.
Kena
+1  A: 

Depending on your data and its meaning, you might want to look into RANSAC (random sample consensus). This is widely used in computer vision, and generally gives excellent results when trying to fit data with lots of outliers to a model.

And it's very simple to conceptualize and explain. On the other hand, it's non deterministic, which may cause problems depending on the application.

Kena
+2  A: 

Start by plotting the leverage of the outliers and then go for some good ol' interocular trauma (aka look at the scatterplot).

Lots of statistical packages have outlier/residual diagnostics, but I prefer Cook's D. You can calculate it by hand if you'd like using this formula from mtsu.edu.

eric.a.booth
hey thanx for the great Link...
Sam Rudolph
A: 

Here is how I would go about it in SQL Server

The query below will get the average weight from a fictional Scale table holding a single weigh-in for each person while not permitting those who are overly fat or thin to throw off the more realistic average:

  select w.Gender, Avg(w.Weight) as AvgWeight
    from ScaleData w
    join ( select d.Gender, Avg(d.Weight) as AvgWeight, 
                  2*STDDEVP(d.Weight) StdDeviation
             from ScaleData d
            group by d.Gender
         ) d
      on w.Gender = d.Gender
     and w.Weight between d.AvgWeight-d.StdDeviation 
                      and d.AvgWeight+d.StdDeviation
   group by w.Gender  

There may be a better way to go about this, but it works and works well. If you have come across another more efficient solution, I’d love to hear about it.

NOTE: the above removes the top and bottom 5% of outliers out of the picture for purpose of the Average. You can adjust the number of outliers removed by adjusting the 2* in the 2*STDDEVP as per: http://en.wikipedia.org/wiki/Standard_deviation

Jason Southwell