views:

3444

answers:

4

What's the best way to calculate percentile rankings (e.g. the 90th percentile or the median score) in MSSQL 2005?

I'd like to be able to select the 25th, median, and 75th percentiles for a single column of scores (preferably in a single record so I can combine with average, max, and min). So for example, table output of the results might be:

Group|MinScore|MaxScore|AvgScore|pct25|median|pct75
T1|52|96|74|68|76|84
T2|48|98|74|68|75|85

A: 

i'd probably use a the sql server 2005 row_number() over (order by score ) / (select count(*) from scores) or something along those lines.

A: 

i'd do something like:

select @n = count(*) from tbl1
select @median = @n / 2
select @p75 = @n * 3 / 4
select @p90 = @n * 9 / 10

select top 1 score from (select top @median score from tbl1 order by score asc) order by score desc

is this right?

+4  A: 

I would think that this would be the simplest solution:

SELECT TOP N PERCENT FROM TheTable ORDER BY TheScore DESC

Where N = (100 - desired percentile). So if you wanted all rows in the 90th percentile, you'd select the top 10%.

I'm not sure what you mean by "preferably in a single record". Do you mean calculate which percentile a given score for a single record would fall into? e.g. do you want to be able to make statements like "your score is 83, which puts you in the 91st percentile." ?

EDIT: OK, I thought some more about your question and came up with this interpretation. Are you asking how to calculate the cutoff score for a particular percentile? e.g. something like this: to be in the 90th percentile you must have a score greater than 78.

If so, this query works. I dislike sub-queries though, so depending on what it was for, I'd probably try to find a more elegant solution. It does, however, return a single record with a single score.

-- Find the minimum score for all scores in the 90th percentile
SELECT Min(subq.TheScore) FROM
(SELECT TOP 10 PERCENT TheScore FROM TheTable
ORDER BY TheScore DESC) AS subq
Matt
A: 

I've been working on this a little more, and here's what I've come up with so far:

CREATE PROCEDURE [dbo].[TestGetPercentile]

@percentile as float,
@resultval as float output

AS

BEGIN

WITH scores(score, prev_rank, curr_rank, next_rank) AS (
    SELECT dblScore,
     (ROW_NUMBER() OVER ( ORDER BY dblScore ) - 1.0) / ((SELECT COUNT(*) FROM TestScores) + 1)  [prev_rank],
     (ROW_NUMBER() OVER ( ORDER BY dblScore ) + 0.0) / ((SELECT COUNT(*) FROM TestScores) + 1)  [curr_rank],
     (ROW_NUMBER() OVER ( ORDER BY dblScore ) + 1.0) / ((SELECT COUNT(*) FROM TestScores) + 1)  [next_rank]
    FROM TestScores
)

SELECT @resultval = (
    SELECT TOP 1 
    CASE WHEN t1.score = t2.score
     THEN t1.score
    ELSE
     t1.score + (t2.score - t1.score) * ((@percentile - t1.curr_rank) / (t2.curr_rank - t1.curr_rank))
    END
    FROM scores t1, scores t2
    WHERE (t1.curr_rank = @percentile OR (t1.curr_rank < @percentile AND t1.next_rank > @percentile))
     AND (t2.curr_rank = @percentile OR (t2.curr_rank > @percentile AND t2.prev_rank < @percentile))
)

END

Then in another stored procedure I do this:

DECLARE @pct25 float;
DECLARE @pct50 float;
DECLARE @pct75 float;

exec SurveyGetPercentile .25, @pct25 output
exec SurveyGetPercentile .50, @pct50 output
exec SurveyGetPercentile .75, @pct75 output

Select
    min(dblScore) as minScore,
    max(dblScore) as maxScore,
    avg(dblScore) as avgScore,
    @pct25 as percentile25,
    @pct50 as percentile50,
    @pct75 as percentile75
From TestScores

It still doesn't do quite what I'm looking for. This will get the stats for all tests; whereas I would like to be able to select from a TestScores table that has multiple different tests in it and get back the same stats for each different test (like I have in my example table in my question).

Soldarnal