views:

2551

answers:

8

What do you call an Average that does not include outliers? for example if you have a set:

{90,89,92,91,5} avg = 73.4

but excluding the outlier (5) we have

{90,89,92,91(,5)} avg = 90.5

How do you describe this average in statistics?

+9  A: 

It's called the trimmed mean. Basically what you do is compute the mean of the middle 80% of your data, ignoring the top and bottom 10%. Of course, these numbers can vary, but that's the general idea.

dsimcha
Using a rule like "biggest 10%" doesn't make sense. What if there are no outliers? The 10% rule would eliminate some data anyway. Unacceptable.
Jason Cohen
See my answer for a statistically-significant way to decide which data qualify as an "outlier."
Jason Cohen
Well, there's no rigorous definition of outlier. As for your response, if there are outliers they will affect your estimate of the standard deviation. Furthermore, standard deviation can be a bad measure of dispersion for non-normally distributed data.
dsimcha
True there's no rigorous definition, but eliminating based on percentile is certainly wrong in many common cases, including the example given in the question.
Jason Cohen
Also, outliers will not affect standard deviation much. Unless there are many of them, in which case they aren't outliers! You might for example have a bi-modal or linearly random distribution, but then throwing out data is wrong, and indeed the notion of "average" might be wrong.
Jason Cohen
+6  A: 

For a very specific name, you'll need to specify the mechanism for outlier rejection. One general term is "robust".

dsimcha mentions one approach: trimming. Another is clipping: all values outside a known-good range are discarded.

Mr Fooz
+1  A: 

I don't know if it has a name, but you could easily come up with a number of algorithms to reject outliers:

  1. find all numbers between the 10th and 90th percentiles (do this by sorting then rejecting the first N/10 and last N/10 numbers) and take the mean value of the remaining values

  2. sort values, reject high and low values as long as by doing so, the mean/standard deviation change more than X%

  3. sort values, reject high and low values as long as by doing so, the values in question are more than K standard deviations from the mean

Jason S
+8  A: 

The "average" you're talking about is actually called the "mean".

It's not exactly answering your question, but a different statistic which is not affected by outliers is the median, that is, the middle number.

{90,89,92,91,5} mean: 73.4
{90,89,92,91,5} median: 90

This might be useful to you, I dunno.

nickf
You are all missing the point. It has nothing to do with the mean, median, mode, stdev etc.Consider this: you have {1,1,2,3,2,400} avg = 68.17but what we want is:{1,1,2,3,2,400} avg = 1.8 //minus the [400] valueWhat do you call that?
Tawani
@Tawani - they are not all missing the point. What you say needs to be defined using generic terms. You cannot go with a single example. Without general definitions, if 400 is 30 is it still an outlier? And if it is 14? And 9? Where do you stop? You need stddev's, ranges, quartiles, to do that.
Daniel Daranas
+5  A: 

There is no official name because of the various mechanisms, such as Q test, used to get rid of outliers.

Removing outliers is called trimming.

No program I have ever used has average() with an integrated trim()

mvrak
+11  A: 

A statistically sensible approach is to use a standard deviation cut-off.

For example, remove any results +/-3 standard deviations.

Using a rule like "biggest 10%" doesn't make sense. What if there are no outliers? The 10% rule would eliminate some data anyway. Unacceptable.

Jason Cohen
I was going to say this approach doesn't work (pathological case = 1000 numbers between -1 and +1, and then a single outlier of value +10000) because an outlier can bias the mean so that none of the results are within 3 stddev of the mean, but it looks like mathematically it *does* work.
Jason S
It's not at all hard to prove that there has to be at least one data point within one standard deviation (inclusive) of the mean. Any outlier big enough to pull the mean way out is going to enlarge the standard deviation a lot.
David Thornley
http://en.wikipedia.org/wiki/Chebychev%27s_inequality This applies regardless of the distribution.
dsimcha
ooh! thanks dsimcha! Chebyshev is one of my math heroes (mostly for function approximations).
Jason S
The problem is that "outlier" isn't post-hoc conclusion about a particular realized data set. It's hard to know what people mean by outlier without knowing what the purpose of their proposed mean statistic is.
Gregg Lind
So your categorial statement of "unacceptable" is non-sense, and not really very helpful. The trimmed mean has some useful properties, and some less useful, like any statistic.
Gregg Lind
@Gregg: I agree with you. Your statement is more accurate than mine. However I still contend that generally it's more useful to depend on dispersion rather than percentile.
Jason Cohen
+6  A: 

Another standard test for identifying outliers is to use 1.5 times the interquartile range. This is somewhat easier than computing the standard deviation and more general since it doesn't make any assumptions about the underlying data being from a normal distribution.

Mark Lavin
A: 

The most common way of having a Robust (the usual word meaning resistant to bad data) average is to use the median. This is just the middle value in the sorted list (of half way between the middle two values), so for your example it would be 90.5 = half way between 90 and 91.

If you want to get really into robust statistics (such as robust estimates of standard deviation etc) I would recommend a lost of the code at The AGORAS group but this may be too advanced for your purposes.

Nick Fortescue