views:

310

answers:

3

I have the following SQL table -

Date       StoreNo       Sales
23/4            34     4323.00
23/4            23      564.00
24/4            34     2345.00
etc

I am running a query that returns average sales, max sales and min sales for a certain period -

select avg(Sales), max(sales), min(sales)
from tbl_sales
where date between etc

But there are some values coming through in the min and max that are really extreme - perhaps because the data entry was bad, perhaps because some anomoly had occurred on that date and store.

What I'd like is a query that returns average, max and min, but somehow excludes the extreme values. I am open to how this is done, but perhaps it would use standard deviations in some way (for example, only using data within x std devs of the true average).

Many thanks

+2  A: 

In order to calculate the standard deviation, you need to iterate through all of the elements, so it would be impossible to do this in one query. The lazy way would be to just do it in two passes:

DECLARE
    @Avg int,
    @StDev int

SELECT @Avg = AVG(Sales), @StDev = STDEV(Sales)
FROM tbl_sales
WHERE ...

SELECT AVG(Sales) AS AvgSales, MAX(Sales) AS MaxSales, MIN(Sales) AS MinSales
FROM tbl_sales
WHERE ...
AND Sales >= @Avg - @StDev * 3
AND Sales <= @Avg + @StDev * 3

Another simple option that might work (fairly common in analysis of scientific data) would be to just drop the minimum and maximum x values, which works if you have a lot of data to process. You can use ROW_NUMBER to do this in one statement:

WITH OrderedValues AS
(
    SELECT
        Sales,
        ROW_NUMBER() OVER (ORDER BY Sales) AS RowNumAsc,
        ROW_NUMBER() OVER (ORDER BY Sales DESC) AS RowNumDesc
)
SELECT ...
FROM tbl_sales
WHERE ...
AND Sales >
(
    SELECT MAX(Sales)
    FROM OrderedValues
    WHERE RowNumAsc <= @ElementsToDiscard
)
AND Sales <
(
    SELECT MIN(Sales)
    FROM OrderedValues
    WHERE RowNumDesc <= @ElementsToDiscard
)

Replace ROW_NUMBER with RANK or DENSE_RANK if you want to discard a certain number of unique values.

Beyond these simple tricks you start to get into some pretty heavy stats. I have to deal with similar kinds of validation and it's far too much material for a SO post. There are a hundred different algorithms that you can tweak in a dozen different ways. I would try to keep it simple if possible!

Aaronaught
+1  A: 

Maybe what you're looking for are percentiles.

Standard deviation tends to be sensitive to outliers, since it's calculated using the square of the difference between a value and the mean.

Maybe a more robust, less sensitive measure like absolute value of difference between a value and the mean would be more appropriate in your case.

duffymo
+1  A: 

Expanding on DuffyMo's post you could do something like

With SalesStats As
    (
    Select Sales, NTILE( 100 ) OVER ( Order By Sales ) As NtileNum
    From tbl_Sales
    )
Select Avg( Sales ), Max( Sales ), Min( Sales )
From SalesStats
Where NtileNum Between 5 And 95

This will exclude the lowest 5% and highest 95%. If you have numbers that vary wildly, you may find that the Average isn't a quality summary statistic and should consider using median. You can do that by doing something like:

With SalesStats As
    (
    Select NTILE( 100 ) OVER ( Order By Sales ) As NtileNum
        , ROW_NUMBER() OVER ( Order By Id ) As RowNum
    From tbl_Sales
    )
    , TotalSalesRows
        (
        Select COUNT(*) As Total
        From tbl_Sales
        )
    , Median As
        (
        Select Sales 
        From SalesStats
            Cross Join TotalSalesRows
        Where RowNum In ( (TotalRows.Total + 1) / 2, (TotalRows.Total + 2) / 2 )
        )
Select Avg( Sales ), Max( Sales ), Min( Sales ), Median.Sales
From SalesStats
    Cross Join Median
Where NtileNum Between 5 And 95
Thomas
Correct about using median vs. average for divergent data; note that the median query can be done with a single scan by putting `COUNT(*)` and `ROW_NUMBER` inside the same CTE, and should probably select `AVG(Sales)` unless you really want two summary rows when the count is even.
Aaronaught
For the first block of code, the following line must include Sales in order to take AVG(Sales): Select Sales, NTILE( 100 ) OVER ( Order By Sales ) As NtileNum
KE
@KE - Oversight on my part. Have corrected. Thx.
Thomas