views:

252

answers:

6

I've got an implementation for this that uses a super hairy recursive CTE that is really hard to follow/maintain. I was hoping that one of the brains on SO could come up with some more straightforward tSQL approach code to accomplish the following:

Table Documents

DocID    SortOrder    PageCount    StartPgNum   EndPgNum
5        1            2               {1}          {2}
8        2            7               {3}          {9}
22       3            3               {10}         {12}

For the table given above, I need a query to populate StartPgNum and EndPgNum (Sample values included in the example in {} to make the intentions clearer for what I need.

Assumptions:
* DocID, SortOrder, and PageCount are pre-populated.
* StartPgNum and EndgNum need to be populated by tSQL code.
* SortOrder always starts at 1, and is continuous with no gaps.
* Documents should get a continuous page numbering scheme as ordered by SortOrder

+1  A: 

The fastest way to do it would be with a Quirky Update. It depends whether you fall into the 'Microsoft don't explicitly say it works so I'll avoid' it camp or not...

Otherwise you're in hairy recursive CTE (as you've already discovered) or triangular join (which could become a nightmare on a large data set) territory.

Matt Whitfield
If it helps, the data set this is going to run against will typically be smallish (less than 50K rows) and almost never over 500K rows.
JohnFx
+4  A: 

Updated to be better :)

DECLARE @temp TABLE (DocID INT, SortOrder INT, PageCount INT)

INSERT INTO @temp VALUES (5, 1, 2)
INSERT INTO @temp VALUES (8, 2, 7)
INSERT INTO @temp VALUES (22, 3, 3)

SELECT
    *,
    StartPgNum + PageCount-1 AS EndPgNum
FROM
(SELECT
    DocID,
    SortOrder,
    PageCount,
    ISNULL((SELECT SUM(PageCount)+1 FROM @temp WHERE SortOrder < parent.SortOrder), 1) AS StartPgNum
FROM
    @temp parent) _temp
Timothy Khouri
Subselect is exactly what i would have done. +1
Ian Boyd
Nice answer!I was so hung up on it being a running total that I didn't think about summing all the page counts from previous rows. BTW I tweaked it to fix the endPgNum calculation. It was adding an extra page on each row.
JohnFx
I saw the edit... thanks :)
Timothy Khouri
I like your single query solution.
awright18
If you're interested in why this will perform poorly over a large number of rows : http://www.sqlservercentral.com/articles/T-SQL/61539/
Matt Whitfield
The short answer is that any query like this is exponential... if there are 10 rows, it'll run 10 + 9 + 8, etc. times. If there are 1,000,000 rows, forget it! Unfortunately there is no way around this unless you calculated and stored the values (in a trigger, or whatever) upon insert/update (which is what I would do).
Timothy Khouri
@Timothy Khouri, you could also use a procedure and have a cursor and then do the addition yourself. That way SQL doesn't have to go back and add everything all over again and each row is touched only once. Also, no need for extra triggers or tables to hold said data to use.
Joshua
Very true - that is a good suggestion (that also has it's drawbacks). For instance, the stored procedure would have to be run, and then the result set used in a query... whereas if it was all a view/query as in the example above, the surrounding JOIN or WHERE clauses may limit the need to hit all records in the table. +1, good comment!
Timothy Khouri
+1  A: 

Maybe one of these three solutions can help, since this is a kind of "running total" problem: http://www.sqlteam.com/article/calculating-running-totals

AaronLS
+1  A: 

SQL 2008 using cross apply (running total)

/*
DocID    SortOrder    PageCount    StartPgNum   EndPgNum
5        1            2               {1}          {2}
8        2            7               {3}          {9}
22       3            3               {10}        {12}
*/

Declare @MyTable TABLE(
DocID int,
SortOrder int,
PageCount int
)

Insert into @MyTable(DocID,SortOrder,PageCount)
values (5,1,2), (8,2,7), (22,3,3)

select 
    T1.docID, 
    T1.Sortorder, 
    T1.Pagecount, 
    (T.RunningTotal - T1.Pagecount) + 1 StartPgNum , 
    T.RunningTotal EndPgNum

FROM    @MyTable T1
        CROSS APPLY ( Select SUM(PageCount) RunningTotal FROM @MyTable where SortOrder <= T1.SortOrder) T
order by T1.sortorder
Ben Dempsey
Gotta love the CROSS APPLY.
JohnFx
This would be the triangular join rbar approach...
Matt Whitfield
A: 

I chose to tackle these two problems by creating functions, one to get the first page the second to get the last page. Here are the functions and the query which will work.

CREATE FUNCTION dbo.GetFirstPage(@SortOrder int) 
RETURNS int
as
BEGIN
DECLARE @FirstPage int
SET @FirstPage = 1
IF(@SortOrder > 1)
BEGIN
SELECT @FirstPage = SUM(PageCount) + 1
FROM Documents
WHERE SortOrder < @SortOrder
END
RETURN @FirstPage
END

CREATE FUNCTION dbo.GetLastPage(@FirstPage int, @PageCount int)
RETURNS int 
AS
BEGIN
RETURN (@FirstPage + @PageCount -1)
END

And finally the query.

SELECT * ,  
        dbo.GetFirstPage(SortOrder) AS FirstPage,
        dbo.GetLastPage(dbo.GetFirstPage(SortOrder),Pagecount) AS LastPage 
FROM Documents
awright18
+3  A: 

I did some testing on all of the solutions provided here in the other answers, my original "Hairy Recursive CTE" option and for the sake of completeness a simple cursor based approach. To my great surprise the cursor option performed the best by a clear margin in all my tests (1K Rows, 10KRows, 50K Rows, 500K Rows)

Here are the average times for each approach for 10K records:
Hairy Recursive CTE: 3 minutes 55 seconds
CROSS APPLY (Ben Dempsey): 21-25 seconds
SUBSELECTS (Tim Khouri): 19-21 seconds
CURSOR: 1-2 Seconds

Here is my cursor based solution:

Declare @temp TABLE(
 DocID INT PRIMARY KEY NOT NULL, 
 SortOrder INT NOT NULL, 
 PageCount INT NOT NULL,
 BegPg int,
 EndPg int
)

Insert into @temp (DocID,SortOrder,PageCount) 
SELECT top 50000 docid, ROW_NUMBER() OVER (ORDER BY DOCID),Pages FROM tblDocuments

DECLARE @PC int
SET @PC=1
DECLARE @FetchPageCount int
DECLARE @FetchDocID int

DECLARE myCursor CURSOR FOR 
SELECT DocID, PageCount FROM @temp ORDER BY SortOrder

OPEN myCursor
FETCH NEXT FROM myCursor INTO @FetchDocID,@FetchPageCount

WHILE @@FETCH_STATUS = 0
BEGIN

  UPDATE @temp SET BegPg=@PC, EndPg=@PC+ @FetchPageCount-1   
  WHERE (Docid=@fetchDocid)

     SET @PC = @PC + @FetchPageCount

    FETCH NEXT FROM myCursor INTO @FetchDocID,@FetchPageCount
END 
CLOSE myCursor
DEALLOCATE myCursor

SELECT * FROM @temp

Who would have guessed it? Maybe cursors aren't always evil.

A word of warning: Lest you be tempted to replace the update to the "WHERE CURRENT OF myCursor" syntax, it performed much slower than using the current version with a where clause, although still faster than most of the other approaches.

JohnFx
I would be interested to see the results from an actual table instead of the Table variable which is in memory.Make sure you put an index on SORTORDER (optionally include PAGECOUNT in the index).Oops NM, just noticed you did get it from an actual table, but check the index.If im bored maybe ill try it at home :)
Ben Dempsey
Running totals is one of the few cases where cursors can improve performance over set-based methods (note I said can, not will) , they are worth trying when you need a running total. I'm glad you tried all aproaches as that is the only what to know what will work best with your particular db design and server specs.
HLGEM