views:

4900

answers:

8

According to MSDN, Median is not available as an aggregate function in Transact-Sql. However, I would like to find out whether it is possible to create this functionality (using the Create Aggregate function, user defined function, or some other method).

What would be the best way (if possible) to do this - allow for the calculation of a median value (assuming a numeric data type) in an aggregate query?

+1  A: 

In a UDF , write

Select Top 1 medianSortColumn from Table T
  Where (Select Count(*) from Table 
         Where MedianSortColumn < 
      (Select Count(*) From Table) / 2)
  Order By medianSortColumn
Charles Bretana
In case of an even number of items, the median is the average of the two middle items, which is not covered by this UDF.
Yaakov Ellis
+1  A: 

See other solutions for median calculation in SQL here: "Simple way to calculate median with MySQL" (the solutions are mostly vendor-independent).

Bill Karwin
Marked as CW so I don't get points for merely linking. :)
Bill Karwin
A: 

There is some good info here on statistics using Transact-SQL.

They suggest the following approach:

SELECT x.Hours median
FROM BulbLife x, BulbLife y
GROUP BY x.Hours
HAVING 
   SUM(CASE WHEN y.Hours <= x.Hours 
      THEN 1 ELSE 0 END)>=(COUNT(*)+1)/2 AND
   SUM(CASE WHEN y.Hours >= x.Hours 
      THEN 1 ELSE 0 END)>=(COUNT(*)/2)+1

It also provides a solution for the financial median: in the case when there is an even number of records, so no "true" median, it will return the average of the inner two records.

RedFilter
A: 

I wanted to work out a solution by myself, but my brain tripped and fell on the way. I think it works, but don't ask me to explain it in the morning. :P

DECLARE @table AS TABLE
(
    Number int not null
);

insert into @table select 2;
insert into @table select 4;
insert into @table select 9;
insert into @table select 15;
insert into @table select 22;
insert into @table select 26;
insert into @table select 37;
insert into @table select 49;

DECLARE @Count AS INT
SELECT @Count = COUNT(*) FROM @table;

WITH MyResults(RowNo, Number) AS
(
    SELECT RowNo, Number FROM
     (SELECT ROW_NUMBER() OVER (ORDER BY Number) AS RowNo, Number FROM @table) AS Foo
)
SELECT AVG(Number) FROM MyResults WHERE RowNo = (@Count+1)/2 OR RowNo = ((@Count+1)%2) * ((@Count+2)/2)
Gavin Schultz-Ohkubo
A: 
--Create Temp Table to Store Results in
DECLARE @results AS TABLE 
(
    [Month] datetime not null
 ,[Median] int not null
);

--This variable will determine the date
DECLARE @IntDate as int 
set @IntDate = -13


WHILE (@IntDate < 0) 
BEGIN

--Create Temp Table
DECLARE @table AS TABLE 
(
    [Rank] int not null
 ,[Days Open] int not null
);

--Insert records into Temp Table
insert into @table 

SELECT 
    rank() OVER (ORDER BY DATEADD(mm, DATEDIFF(mm, 0, DATEADD(ss, SVR.close_date, '1970')), 0), DATEDIFF(day,DATEADD(ss, SVR.open_date, '1970'),DATEADD(ss, SVR.close_date, '1970')),[SVR].[ref_num]) as [Rank]
 ,DATEDIFF(day,DATEADD(ss, SVR.open_date, '1970'),DATEADD(ss, SVR.close_date, '1970')) as [Days Open]
FROM
 mdbrpt.dbo.View_Request SVR
 LEFT OUTER JOIN dbo.dtv_apps_systems vapp 
 on SVR.category = vapp.persid
 LEFT OUTER JOIN dbo.prob_ctg pctg 
 on SVR.category = pctg.persid
 Left Outer Join [mdbrpt].[dbo].[rootcause] as [Root Cause] 
 on [SVR].[rootcause]=[Root Cause].[id]
 Left Outer Join [mdbrpt].[dbo].[cr_stat] as [Status]
 on [SVR].[status]=[Status].[code]
 LEFT OUTER JOIN [mdbrpt].[dbo].[net_res] as [net] 
 on [net].[id]=SVR.[affected_rc]
WHERE
 SVR.Type IN ('P') 
 AND
 SVR.close_date IS NOT NULL 
 AND
 [Status].[SYM] = 'Closed'
 AND
 SVR.parent is null
 AND
 [Root Cause].[sym] in ( 'RC - Application','RC - Hardware', 'RC - Operational', 'RC - Unknown')
 AND
 (
  [vapp].[appl_name] in ('3PI','Billing Rpts/Files','Collabrent','Reports','STMS','STMS 2','Telco','Comergent','OOM','C3-BAU','C3-DD','DIRECTV','DIRECTV Sales','DIRECTV Self Care','Dealer Website','EI Servlet','Enterprise Integration','ET','ICAN','ODS','SB-SCM','SeeBeyond','Digital Dashboard','IVR','OMS','Order Services','Retail Services','OSCAR','SAP','CTI','RIO','RIO Call Center','RIO Field Services','FSS-RIO3','TAOS','TCS')
 OR
  pctg.sym in ('Systems.Release Health Dashboard.Problem','DTV QA Test.Enterprise Release.Deferred Defect Log')
 AND  
  [Net].[nr_desc] in ('3PI','Billing Rpts/Files','Collabrent','Reports','STMS','STMS 2','Telco','Comergent','OOM','C3-BAU','C3-DD','DIRECTV','DIRECTV Sales','DIRECTV Self Care','Dealer Website','EI Servlet','Enterprise Integration','ET','ICAN','ODS','SB-SCM','SeeBeyond','Digital Dashboard','IVR','OMS','Order Services','Retail Services','OSCAR','SAP','CTI','RIO','RIO Call Center','RIO Field Services','FSS-RIO3','TAOS','TCS')
 )
 AND
 DATEADD(mm, DATEDIFF(mm, 0, DATEADD(ss, SVR.close_date, '1970')), 0) = DATEADD(mm, DATEDIFF(mm,0,DATEADD(mm,@IntDate,getdate())), 0)
ORDER BY [Days Open]



DECLARE @Count AS INT
SELECT @Count = COUNT(*) FROM @table;

WITH MyResults(RowNo, [Days Open]) AS
(
    SELECT RowNo, [Days Open] FROM
        (SELECT ROW_NUMBER() OVER (ORDER BY [Days Open]) AS RowNo, [Days Open] FROM @table) AS Foo
)


insert into @results
SELECT 
 DATEADD(mm, DATEDIFF(mm,0,DATEADD(mm,@IntDate,getdate())), 0) as [Month]
 ,AVG([Days Open])as [Median] FROM MyResults WHERE RowNo = (@Count+1)/2 OR RowNo = ((@Count+1)%2) * ((@Count+2)/2) 


set @IntDate = @IntDate+1
DELETE FROM @table
END

select *
from @results
order by [Month]
Gregg Silverman
+11  A: 

There are lots of ways to do this, with dramtically varying performance. Here's one particularly well-optimized solution, from http://sqlblog.com/blogs/adam_machanic/archive/2006/12/18/medians-row-numbers-and-performance.aspx. This is a particuarly optimal soluton when it comes to actual I/Os generated during execution-- it looks more costly than other solutions but is actually much faster.

That page also contains a discussion of other solutions and perf testing details. Note the use of a unique column as a disambiguator in case there are multiple rows with the same value of the median column.

As with all database performance scenarios, always try to test a solution out with real data on real hardware-- you never know when a change to SQL Server's optimizer or a peculiarity in your environment will make a normally-speedy solution slower.

SELECT
   CustomerId,
   AVG(TotalDue)
FROM
(
   SELECT
      CustomerId,
      TotalDue,
      ROW_NUMBER() OVER (
         PARTITION BY CustomerId 
         ORDER BY TotalDue ASC, SalesOrderId ASC) AS RowAsc,
      ROW_NUMBER() OVER (
         PARTITION BY CustomerId 
         ORDER BY TotalDue DESC, SalesOrderId DESC) AS RowDesc
   FROM Sales.SalesOrderHeader SOH
) x
WHERE 
   RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY CustomerId
ORDER BY CustomerId;
Justin Grant
I don't think this works if you have dupes, particularly a lot of dupes, in your data. You can't guarantee the row_numbers will line up. You can get some really crazy answers for your median, or even worse, no median at all.
Jonathan Beerhalter
That's why having a disambiguator (SalesOrderId in the code example above) is important, so you can ensure that the order of result-set rows is consistent both backwards and forwards. Often a unique primary key makes an ideal disambiguator because it's available without a separate index lookup. If there's no disambiguation column available (for example, if the table has no uniquifying key), then another approach must be used to calculate median, because as you correctly point out, if you can't guarantee that DESC row numbers are mirror images of ASC row numbers, then results are unpredictable.
Justin Grant
Thanks, when switching the columns to my DB, I dropped the disambiguator, thinking it wasn't relevant. In that case, this solution works really really well.
Jonathan Beerhalter
+3  A: 

If you're using SQL 2005 or better this is a nice, simple-ish median calculation for a single column in a table:

SELECT
(
 (SELECT MAX(Score) FROM
   (SELECT TOP 50 PERCENT Score FROM Posts ORDER BY Score) AS BottomHalf)
 +
 (SELECT MIN(Score) FROM
   (SELECT TOP 50 PERCENT Score FROM Posts ORDER BY Score DESC) AS TopHalf)
) / 2 AS Median
Jeff Atwood
+1  A: 
select  max(my_column) as [my_column], quartile
from    (select my_column, ntile(4) over (order by my_column) as [quartile]
         from   my_table) i
--where quartile = 2
group by quartile

This will give you the median and interquartile range in one fell swoop. If you really only want one row that is the median then uncomment the where clause.

When you stick that into an explain plan, 60% of the work is sorting the data which is unavoidable when calculating position dependent statistics like this.

Sir Wobin
This actually works pretty well, and allows for partitioning of the data.
Jonathan Beerhalter