ansaurus

Question

Multiple Row_Number() Calls in a Single SQL Query

Answer 1

+2 A:

Each ROW_NUMBER requires the rows to be sorted first. Since your two RNs have different ORDER BY conditions, the query must produce the result, then order it for first RNs (it may be orderred already by), produce the RN, then order it for second RN and produce the second RN result. There simply isn't any magic pixie dust that can materialize a row number value without counting where the row is in the required order.

Remus Rusanu 2009-09-04 16:59:37

I understand that there's no magic pixie dust available, there's a world-wide shortage. :)I know that it can't figure out what the RN is w/o first ordering it. How can I set it up so it orders it different ways in parallel to calc the RN? Is there a technique to break it into multiple queries and then join the result sets? I'm not married to using the RN style, so any constructive idea would be appreciated. I can't be the first person in the world that wants to take a set of data and calculate multiple medians at once efficiently! To do that the data must be sorted in different ways.

JayRu 2009-09-04 17:07:54

Is really hard with row_numbers over 8 different orders, and with partition by requirements. Even with subqueries that *may* be paralelized, is unlikely they will. Paralele options are availableas an option to partition execution of a single operation, like a table scan, not for splitting multiple different subqueries. I would revisit the requirements and reconsider the need for all the row_numbers...

Remus Rusanu 2009-09-04 17:51:15

Unfortunately, calculating a median requires that the data be sorted in order. The Row_Number simply just tells you how this data was sorted for a given field. Thx for the help so far...

JayRu 2009-09-04 19:41:26

Answer 2

+1 A:

I am not sure that it can parallelize this, because it needs to do nonpartitioned (wrt population vs square miles) scans. They'll conflict with each on disk, so it has to get everything into memory at least once, first and then it might be eligible for parallelizing, if it's big enough.

In any event, the following performs significantly (40%) faster for me:

;WITH cte AS (
    SELECT
        StateID
        ,TimeDimID
        ,ConstructionStatusID
        ,NumberOfRows = COUNT(*) OVER (PARTITION BY StateID, TimeDimID, ConstructionStatusID)
        ,PopulationSizeRowNum = ROW_NUMBER() OVER (PARTITION BY StateID, TimeDimID, ConstructionStatusID ORDER BY PopulationSize)
        ,SquareMilesRowNum = ROW_NUMBER() OVER (PARTITION BY StateID, TimeDimID, ConstructionStatusID ORDER BY SquareMiles)
        ,PopulationSize
        ,SquareMiles
    FROM TestMedian
)
, ctePop AS (
    SELECT MinPopNum = MIN(PopulationSizeRowNum)
    , MaxPopNum = MAX(PopulationSizeRowNum)
    , StateID, TimeDimID, ConstructionStatusID
    , MedianPopulationSize= AVG(PopulationSize) 
    FROM cte T
    WHERE PopulationSizeRowNum IN((NumberOfRows + 1) / 2, (NumberOfRows + 2) / 2)
    GROUP BY StateID, TimeDimID, ConstructionStatusID
)
, cteSqM AS (
    SELECT MinSqMNum = MIN(SquareMilesRowNum)
    , MaxSqMNum = MAX(SquareMilesRowNum)
    , StateID, TimeDimID, ConstructionStatusID
    , MedianSquareMiles= AVG(SquareMiles) 
    FROM cte T
    WHERE SquareMilesRowNum IN((NumberOfRows + 1) / 2, (NumberOfRows + 2) / 2)
    GROUP BY StateID, TimeDimID, ConstructionStatusID
)
SELECT s.StateID, s.TimeDimID, s.ConstructionStatusID
, MinPopNum, MaxPopNum, MedianPopulationSize
, MinSqMNum, MaxSqMNum, MedianSquareMiles
FROM ctePop p
JOIN cteSqM s ON s.StateID = p.StateID
    AND s.TimeDimID = p.TimeDimID
    AND s.ConstructionStatusID = p.ConstructionStatusID

Also, the sorts themselves should get parallelized once they get big enough. You'll need test rows at least 100,000 before that might happen though.

OK, yep, I get parallelism after I load it up enough with this statement:

INSERT INTO TestMedian 
SELECT abs(id)%3,abs(id)%2,abs(id)%5, abs(id), colid * 10000
  From master.sys.syscolumns, (select top 10 * from master.dbo.spt_values)a

RBarryYoung 2009-09-04 17:38:54

Thx. I'm testing this approach on my actual data set now to see if the row counts are parallelized. On a small subset it looked promising.

JayRu 2009-09-04 19:46:15

Answer 3

A:

Some lateral thinking: If you need this data often and/or quickly, and the underlying data set doesn't change frequently (for reasonably high values of "frequently"), could you precalculate any of these values and store them in some form of pre-aggregated table?

(Yep, this is demonormalization, but if you need performance over all else, it's worth considering.)

Philip Kelley 2009-09-04 18:16:20

I meant to say "denormalization" there. Honest.

Philip Kelley 2009-09-04 18:17:16

I believe you :). Unfortunately, I don't see a pre-aggregation step here, though. In this example, the population sizes are spread across a set of dimensions. For each set of dimensions, I need to find the median value of the population size. The only pre-aggregation I can think of is to replace the individual dimensions with an identifier so the partitioning, grouping, and joining is done on fewer columns (might be really worth it).

JayRu 2009-09-04 19:54:08

ansaurus

tags:

views:

answers:

Multiple Row_Number() Calls in a Single SQL Query

related questions