I need a T-SQL ranking approach similar to the one provided by NTILE(), except that the members of each tile would be on a sliding distribution so that higher ranking tiles have fewer members.
For example
CREATE TABLE #Rank_Table(
id int identity(1,1) not null,
hits bigint not null default 0,
PERCENTILE smallint null
)
--Slant the distribution of the data
INSERT INTO #Rank_Table (hits)
select CASE
when DATA > 9500 THEN DATA*30
WHEN data > 8000 THEN DATA*5
WHEN data < 7000 THEN DATA/3 +1
ELSE DATA
END
FROM
(select top 10000 (ABS(CHECKSUM(NewId())) % 99 +1) * (ABS(CHECKSUM(NewId())) % 99 +1 ) DATA
from master..spt_values t1
cross JOIN master..spt_values t2) exponential
Declare @hitsPerGroup as bigint
Declare @numGroups as smallint
set @numGroups=100
select @hitsPerGroup=SUM(hits)/(@numGroups -1) FROM #Rank_Table
select @hitsPerGroup HITS_PER_GROUP
--This is an even distribution
SELECT id,HITS, NTILE(@numGroups) Over (Order By HITS DESC) PERCENTILE
FROM #Rank_Table
GROUP by id, HITS
--This is my best attempt, but it skips groups because of the erratic distribution
select
T1.ID,
T1.hits,
T.RunningTotal/@hitsPerGroup + 1 TILE,
T.RunningTotal
FROM #Rank_Table T1
CROSS APPLY ( Select SUM(hits) RunningTotal FROM #Rank_Table where hits <= T1.hits) T
order by T1.hits
DROP TABLE #Rank_Table
In #Rank_table, NTILE(@numGroups) creates an even distribution of @numGroups groups. What I need are @numGroups groups where the Tile 100 has the fewest members, tile 2 would have one or more than tile 1, tile 3 would have 1 or more than tile 2 ... tile 100 would have the most.
I'm using SQL Server 2008. In practice this will be run against a permanent table with potentially millions of rows in order to periodically update the PERCENTILe column with its percentile from 1-100.
My best attempt above will skip percentiles and performs poorly. There must be a better way. My statistics are fuzzy, but it seems that some approach to make the hits more linear would do the trick?