views:

91

answers:

2

Hi all,

I need some help with a SQL query for SQL Server 2005. Here is an example of the data in the table to be queried:

Id    PersonId      PayrollNum    ContractId     PayrollFrom     PayrollTo
---------------------------------------------------------------------------
 1     432642         85110892     1             01/05/2009      31/05/2009
 2     432642         85110892     1             01/06/2009      30/06/2009
 3     432642         85110892     1             01/07/2009      31/07/2009
 4     432642         85110892     2             01/05/2009      31/05/2009
 5     432642         85110892     2             01/06/2009      30/06/2009
 6     432642         85110892     2             01/07/2009      31/07/2009
 7     432642         85110892     2             01/08/2009      31/08/2009
 8     432642         35110892     1             01/06/2009      30/06/2009
 9     432642         35110892     1             01/05/2009      31/05/2009
10     432642         35110892     1             01/07/2009      31/07/2009

(I hope that gets formatted ok - it's hard doing tables with this thing!)

The output I need is as follows:

Id
--
 1
 4
 9

An explanation: What I basically need are the unique Id values for the records that differ by PersonId, PayrollNum and ContractId and also be the lowest PayrollFrom date for each grouping. I'm not sure how to better explain it? Hopefully you can look at the output to help understand the query's requirements.

Let me know if I need to explain this better (or if you think you understand it and can explain it better please do so).

Thanks for your help, James.

+9  A: 
WITH CTE AS (
SELECT
    /* uncomment if needed PersonId, PayrollNum, ContractId, PayrollFrom, */
    ID,
    ROW_NUMBER() OVER (
        PARTITION BY PersonId, PayrollNum, ContractId
        ORDER BY PayrollFrom ASC -- thank you to Peter Lang DESC
     ) AS Ranking
FROM
    MyTable
)
SELECT
    ID
FROM
    CTE
WHERE
    Ranking = 1

Edit, after comment

Try something like ISNULL(PersonId, -ID) instead of PersonId to force it unique to differentiate each NULL. I use -ID to it shouldn't clash where ID = a valid PersonID

gbn
Shouldn't you sort ASCending to get lowest PayrollFrom?
Peter Lang
I saw your comment, but what is your thoughts on the performance of the CTE approach when the table gets bigger?
astander
@Peter: yes. oops. fixed. thanks.
gbn
@astander: without an index on (PersonId, PayrollNum, ContractId, PayrollFrom) both solution will suffer. To answer, I tested CTEs after I asked Quassnoi the same thing. They scale well. IIRC, less "Scan count", more IO.
gbn
@Astander: It's worth noting that the execution plan and the SQL query are different things. No matter what, the data needs to be ordered to find out what record will be "Ranking = 1". Provided there are indexes to avoid having to sort the data, that "Ranking = 1" condition can be used by the optimiser to disregard all the other data. In short, with correct indexes this should scale very well; thanks to the optimiser actually being very intelligent.
Dems
This seems true, i wrote a quick test script, and they seem fairly close, with the CTE leading. Good catch here, looks very neat too.
astander
I just tested on a 20k row table. identical IO + results, different plan, 50% each of batch. Different keys/indexes of course.
gbn
@astander: An intersting exercise I once did was to compare the execution plans of different queries which are functionally the same. Some differed, some were identical. You may find that the execution plan for the ROW_NUMBER() approach is similar to the SubQuery approach. [Also try SELECT * FROM table WHERE id = (SELECT TOP 1 id FROM table AS [sub] WHERE sub.PersonId = table.PersonId AND sub.PayrollNum = table.PayrollNum AND sub.ContractId = table.ContractId ORDER BY sub.PayrollFrom ASC, sub.id ASC)]
Dems
Thanks guys. This seems to work from me and I'll look into setting up the required indexes. Genius!
j.strugnell
Not sure if any of you guys are still paying attention but I have another issue that has come to light. If either of the PersonId, PayrollNum and ContractId columns are NULL then that needs to be treated as different to another NULL in the same column. For example if the PersonId was null in row 2 and 3 then I need the output to be 1,2,3,4,9. The current query gives 1,2,4,9 because it treats the 2 nulls as equal. Any ideas?
j.strugnell
Thanks gbn, nice trick!
j.strugnell
+3  A: 

You will need a query that looks something like this

SELECT  *
FROM    myTable t INNER JOIN
     (
      SELECT PersonId, 
        PayrollNum,
        ContractId ,
        MIN(PayrollFrom) MIN_PayrollFrom
      FROM MyTable
      GROUP BY PersonId, 
        PayrollNum,
        ContractId
     ) minDates ON t.PersonId = minDates.PersonId
        AND  t.PayrollNum = minDates.PayrollNum
        AND  t.ContractId = minDates.ContractId
        AND  t.PayrollFrom = minDates.MIN_PayrollFrom
astander
CTEs and ranking functions make things so much nicer these days :-)
gbn