Hi Everyone,
I'm trying to setup some data to calculate multiple medians in SQL Server 2008, but I'm having a performance problem. Right now, I'm using this pattern ([another example bottom). Yes, I'm not using a CTE, but using one won't fix the problem I'm having anyways and the performance is poor because the row_number sub-queries run in serial, not parallel.
Here's a full example. Below the SQL I explain the problem more.
-- build the example table
CREATE TABLE #TestMedian (
StateID INT,
TimeDimID INT,
ConstructionStatusID INT,
PopulationSize BIGINT,
SquareMiles BIGINT
);
INSERT INTO #TestMedian (StateID, TimeDimID, ConstructionStatusID, PopulationSize, SquareMiles)
VALUES (1, 1, 1, 100000, 200000);
INSERT INTO #TestMedian (StateID, TimeDimID, ConstructionStatusID, PopulationSize, SquareMiles)
VALUES (1, 1, 1, 200000, 300000);
INSERT INTO #TestMedian (StateID, TimeDimID, ConstructionStatusID, PopulationSize, SquareMiles)
VALUES (1, 1, 1, 300000, 400000);
INSERT INTO #TestMedian (StateID, TimeDimID, ConstructionStatusID, PopulationSize, SquareMiles)
VALUES (1, 1, 1, 100000, 200000);
INSERT INTO #TestMedian (StateID, TimeDimID, ConstructionStatusID, PopulationSize, SquareMiles)
VALUES (1, 1, 1, 250000, 300000);
INSERT INTO #TestMedian (StateID, TimeDimID, ConstructionStatusID, PopulationSize, SquareMiles)
VALUES (1, 1, 1, 350000, 400000);
--TruNCATE TABLE TestMedian
SELECT
StateID
,TimeDimID
,ConstructionStatusID
,NumberOfRows = COUNT(*) OVER (PARTITION BY StateID, TimeDimID, ConstructionStatusID)
,PopulationSizeRowNum = ROW_NUMBER() OVER (PARTITION BY StateID, TimeDimID, ConstructionStatusID ORDER BY PopulationSize)
,SquareMilesRowNum = ROW_NUMBER() OVER (PARTITION BY StateID, TimeDimID, ConstructionStatusID ORDER BY SquareMiles)
,PopulationSize
,SquareMiles
INTO #MedianData
FROM #TestMedian
SELECT MinRowNum = MIN(PopulationSizeRowNum), MaxRowNum = MAX(PopulationSizeRowNum), StateID, TimeDimID, ConstructionStatusID, MedianPopulationSize= AVG(PopulationSize)
FROM #MedianData T
WHERE PopulationSizeRowNum IN((NumberOfRows + 1) / 2, (NumberOfRows + 2) / 2)
GROUP BY StateID, TimeDimID, ConstructionStatusID
SELECT MinRowNum = MIN(SquareMilesRowNum), MaxRowNum = MAX(SquareMilesRowNum), StateID, TimeDimID, ConstructionStatusID, MedianSquareMiles= AVG(SquareMiles)
FROM #MedianData T
WHERE SquareMilesRowNum IN((NumberOfRows + 1) / 2, (NumberOfRows + 2) / 2)
GROUP BY StateID, TimeDimID, ConstructionStatusID
DROP TABLE #MedianData
DROP TABLE #TestMedian
The problem with this query is that SQL Server executes both of the "ROW__NUMBER() OVER..." sub-queries in serial, not in parallel. So if I have 10 of these ROW__NUMBER calculations, it'll calculate them one after the other and I get linear growth, which stinks. I have an 8-way 32GB system I'm running this query on and I would love some parallelism. I'm trying to run this type of query on a 5,000,000 row table.
I can tell its doing this by looking at the query plan and seeing the Sorts in the same execution path (displaying the query plan's XML wouldn't work real well on SO).
So my question is this: How can I alter this query so that the ROW_NUMBER queries are executed in parallel? Is there a completely different technique I can use to prepare the data for multiple median calculations?