views:

113

answers:

3

say I have a postgresql table with the following values:

id | value
----------
1  | 4
2  | 8
3  | 100
4  | 5
5  | 7

If I use postgresql to calculate the average, it gives me an average of 24.8 because the high value of 100 has great impact on the calculation. While in fact I would like to find an average somewhere around 6 and eliminate the extreme(s).

I am looking for a way to eliminate extremes and want to do this "statistically correct". The extreme's cannot be fixed. I cannot say; If a value is over X, it has to be eliminated.

I have been bending my head on the postgresql aggregate functions but cannot put my finger on what is right for me to use. Any suggestions?

+2  A: 

Postgresql can also calculate the standard deviation.

You could take only the data points which are in the average() +/- 2*stddev() which would roughly correspond to the 90% datapoints closest to the average.

Of course 2 can also be 3 (95%) or 6 (99.995%) but do not get hung up on the numbers because in the presence of a collection outliers you are no longer dealing with a normal distribution.

Be very careful and validate that it works as expected.

Peter Tillemans
This sounds good! I didn't know stddev would result in percentages of the set although it sounds perfectly legit. I know if I combine your answer with the one by Rodger, I must be on the right track!
milovanderlinden
+1  A: 

I cannot say; If a value is over X, it has to be eliminated.

Well, you could use having and a subselect to eliminate outliers, something like:

HAVING value < (
 SELECT 2 * avg(value)
 FROM   mytable
 GROUP BY ...
)

(Or, for that matter, use a more complex version to eliminate anything above 2 or 3 standard deviations if you want something that will be better at eliminating only outliers.)

The other option is to look at generating a median value, which is a fairly statistically sound way of accounting for outliers; happily there are three reasonable examples of just that: one from the Postgresql Wiki, one built as an Oracle compatability layer, and another from the PostgreSQL Journal. Note the caveats around how precisely/accurately they implement medians.

Rodger
Excelent answer, especially the wiki page on aggregate median! I will however, as Peter Tillemans suggest, combine it with the stddev. But since your answer contains the most hints, I will rate it as the correct answer.
milovanderlinden
A: 

easy:

SELECT avg(value) from foo where value < 100;

Always read SQL queries starting with from clause, going to where clause and having this resultset in mind the aggregates get calculated

Janning