views:

118

answers:

5

I am trying to calculate an average without being thrown off by a small set of far off numbers (ie, 1,2,1,2,3,4,50) the single 50 will throw off the entire average.

If I have a list of numbers like so:

19,20,21,21,22,30,60,60

The average is 31

The median is 30

The mode is 21 & 60 (averaged to 40.5)

But anyone can see that the majority is in the range 19-22 (5 in, 3 out) and if you get the average of just the major range it's 20.6 (a big difference than any of the numbers above)

I am thinking that you can get this like so:

c+d-r

Where c is the count of a numbers, d is the distinct values, and r is the range. Then you can apply this to all the possble ranges, and the highest score is the omptimal range to get an average from.

For example 19,20,21,21,22 would be 5 numbers, 4 distinct values, and the range is 3 (22 - 19). If you plug this into my equation you get 5+4-3=6

If you applied this to the entire number list it would be 8+6-41=-27

I think this works pretty good, but I have to create a huge loop to test against all possible ranges. In just my small example there are 21 possible ranges:

19-19, 19-20, 19-21, 19-22, 19-30, 19-60, 20-20, 20-21, 20-22, 20-30, 20-60, 21-21, 21-22, 21-30, 21-60, 22-22, 22-30, 22-60, 30-30, 30-60, 60-60

I am wondering if there is a more efficient way to get an average like this.

Or if someone has a better algorithm all together?

+2  A: 

You might get some use out of standard deviation here, which basically measures how concentrated the data points are. You can define an outlier as anything more than 1 standard deviation (or whatever other number suits you) from the average, throw them out, and calculate a new average that doesn't include them.

grossvogel
This only really works if you have a normal distribution. With a flat distribution it may well do something bad to the result.
Rafe
A: 

You could put the values into an array, sort the array, and then find the median, which is usually a better number than the average anyway because it discounts outliers automatically, giving them no more weight than any other number.

Robusto
A: 

Here's a pretty naive implementation that you could fix up for your own needs. I purposely kept it pretty verbose. It's based on the five-number-summary often used to figure these things out.

function get_median($arr) {
    sort($arr);
    $c = count($arr) - 1;
    if ($c%2) {
        $b = round($c/2);
        $a = $b-1;
        return ($arr[$b] + $arr[$a]) / 2 ;
    } else {
        return $arr[($c/2)];
    }
}

function get_five_number_summary($arr) {
    sort($arr);
    $c = count($arr) - 1;
    $fns = array();
    if ($c%2) {
        $b = round($c/2);
        $a = $b-1;
        $lower_quartile = array_slice($arr, 1, $a-1);
        $upper_quartile = array_slice($arr, $b+1, count($lower_quartile));
        $fns = array($arr[0], get_median($lower_quartile), get_median($arr), get_median($upper_quartile), $arr[$c-1]);
        return $fns;
    }
    else {
        $b = round($c/2);
        $a = $b-1;
        $lower_quartile = array_slice($arr, 1, $a);
        $upper_quartile = array_slice($arr, $b+1, count($lower_quartile));
        $fns = array($arr[0], get_median($lower_quartile), get_median($arr), get_median($upper_quartile), $arr[$c-1]);
        return $fns;
    }
}

function find_outliers($arr) {
    $fns = get_five_number_summary($arr);
    $interquartile_range = $fns[3] - $fns[1];
    $low = $fns[1] - $interquartile_range;
    $high = $fns[3] + $interquartile_range;
    foreach ($arr as $v) {
        if ($v > $high || $v < $low)
            echo "$v is an outlier<br>";
    }
}

//$numbers = array( 19,20,21,21,22,30,60 ); // 60 is an outlier
$numbers = array( 1,230,239,331,340,800); // 1 is an outlier, 800 is an outlier
find_outliers($numbers);

Not that this method, albeit much simpler to implement than standard deviation, will not find the two 60 outliers in your example, but it works pretty well. Use the code for whatever, hopefully it's useful!

To see how the algorithm works and how I implemented it, go to: http://www.mathwords.com/o/outlier.htm

This, of course, doesn't calculate the final average, but it's kind of trivial after you run find_outliers() :P

David Titarenco
A: 

You might sort your numbers, choose your preferred subrange (e.g., the middle 90%), and take the mean of that.

There is no one true answer to your question, because there are always going to be distributions that will give you a funny answer (e.g., consider a biased bi-modal distribution). This is why may statistics are often presented using box-and-whisker diagrams showing mean, median, quartiles, and outliers.

Rafe
A: 

Why don't you use the median? It's not 30, it's 21.5.

Mike C