What do you call an Average that does not include outliers? for example if you have a set:
{90,89,92,91,5} avg = 73.4
but excluding the outlier (5) we have
{90,89,92,91(,5)} avg = 90.5
How do you describe this average in statistics?
What do you call an Average that does not include outliers? for example if you have a set:
{90,89,92,91,5} avg = 73.4
but excluding the outlier (5) we have
{90,89,92,91(,5)} avg = 90.5
How do you describe this average in statistics?
It's called the trimmed mean. Basically what you do is compute the mean of the middle 80% of your data, ignoring the top and bottom 10%. Of course, these numbers can vary, but that's the general idea.
For a very specific name, you'll need to specify the mechanism for outlier rejection. One general term is "robust".
dsimcha mentions one approach: trimming. Another is clipping: all values outside a known-good range are discarded.
I don't know if it has a name, but you could easily come up with a number of algorithms to reject outliers:
find all numbers between the 10th and 90th percentiles (do this by sorting then rejecting the first N/10 and last N/10 numbers) and take the mean value of the remaining values
sort values, reject high and low values as long as by doing so, the mean/standard deviation change more than X%
sort values, reject high and low values as long as by doing so, the values in question are more than K standard deviations from the mean
The "average" you're talking about is actually called the "mean".
It's not exactly answering your question, but a different statistic which is not affected by outliers is the median, that is, the middle number.
{90,89,92,91,5} mean: 73.4
{90,89,92,91,5} median: 90
This might be useful to you, I dunno.
There is no official name because of the various mechanisms, such as Q test, used to get rid of outliers.
Removing outliers is called trimming.
No program I have ever used has average() with an integrated trim()
A statistically sensible approach is to use a standard deviation cut-off.
For example, remove any results +/-3 standard deviations.
Using a rule like "biggest 10%" doesn't make sense. What if there are no outliers? The 10% rule would eliminate some data anyway. Unacceptable.
Another standard test for identifying outliers is to use 1.5 times the interquartile range. This is somewhat easier than computing the standard deviation and more general since it doesn't make any assumptions about the underlying data being from a normal distribution.
The most common way of having a Robust (the usual word meaning resistant to bad data) average is to use the median. This is just the middle value in the sorted list (of half way between the middle two values), so for your example it would be 90.5 = half way between 90 and 91.
If you want to get really into robust statistics (such as robust estimates of standard deviation etc) I would recommend a lost of the code at The AGORAS group but this may be too advanced for your purposes.