tags:

views:

49

answers:

5

I have a large data-set of emails sent and status-codes.

ID Recipient           Date       Status
 1 [email protected] 01/01/2010      1
 2 [email protected] 02/01/2010      1
 3 [email protected]    01/01/2010      1
 4 [email protected]    02/01/2010      2
 5 [email protected]    03/01/2010      1
 6 [email protected]  01/01/2010      1
 7 [email protected]  02/01/2010      2

In this example:

  • all emails sent to someone have a status of 1
  • the middle email (by date) sent to them has a status of 2, but the latest is 1
  • the last email sent to others has a status of 2

What I need to retrieve is a count of all emails sent to each person, and what the latest status code was.

The first part is fairly simple:

SELECT Recipient, Count(*) EmailCount
FROM Messages
GROUP BY Recipient
ORDER BY Recipient

Which gives me:

Recipient           EmailCount
[email protected] 2
[email protected]    3
[email protected]  2

How can I get the most recent status code too?

The end result should be:

Recipient           EmailCount LastStatus
[email protected]          2          1
[email protected]             3          1
[email protected]           2          2

Thanks.

(Server is Microsoft SQL Server 2008, query is being run through an OleDbConnection in .Net)

A: 

You can use the ranking functions for this. Something like (not tested):

WITH MyResults AS
(
   SELECT Recipient, Status, ROW_NUMBER() OVER( Recipient ORDER BY (  [date] DESC ) ) AS   [row_number]
   FROM Messages
)
SELECT MyResults.Recipient, MyCounts.EmailCount, MyResults.Status
FROM (
    SELECT Recipient, Count(*) EmailCount
    FROM Messages
    GROUP BY Recipient
) MyCounts
INNER JOIN MyResults
ON MyCounts.Recipient = MyResults.Recipient
WHERE MyResults.[row_number] = 1
cleek
+2  A: 

It's not very pretty, but I'd probably just use a couple of subselects:

SELECT Recipient,
    COUNT(*) EmailCount,
    (SELECT Status
     FROM Messages M2
     WHERE Recipient = M.Recipient
         AND Date = (SELECT MAX(Date)
                     FROM Messages
                     WHERE Recipient = M2.Recipient))
FROM Messages M
GROUP BY Recipient
ORDER BY Recipient
Chad Birch
+4  A: 

This is an example of a 'max per group' query. I think it is easiest to understand by splitting it up into two subqueries and then joining the results.

The first subquery is what you already have.

The second subquery uses the windowing function ROW_NUMBER to number the emails for each recipient starting with 1 for the most recent, then 2, 3, etc...

The results from the first query are then joined with the result from the second query that has row number 1, i.e. the most recent. Doing it this way guarantees that you will only get one row for each recipient in the case that there are ties.

Here is the query:

SELECT T1.Recipient, T1.EmailCount, T2.Status FROM
(
    SELECT Recipient, COUNT(*) AS EmailCount
    FROM Messages
    GROUP BY Recipient
) T1
JOIN
(
    SELECT
        Recipient,
        Status,
        ROW_NUMBER() OVER (PARTITION BY Recipient ORDER BY Date Desc) AS rn
    FROM Messages
) T2
ON T1.Recipient = T2.Recipient AND T2.rn = 1

This gives the following results:

Recipient            EmailCount  Status  
[email protected]   2           2       
[email protected]  2           1       
[email protected]     3           1       
Mark Byers
Excellent! Thank you very much.
Cylindric
+1  A: 
SELECT
    M.Recipient,
    C.EmailCount,
    M.Status
FROM
    (
    SELECT Recipient, Count(*) EmailCount
    FROM Messages
    GROUP BY Recipient
    ) C
    JOIN
    (
    SELECT Recipient, MAX(Date) AS LastDate
    FROM Messages
    GROUP BY Recipient
    ) MD ON C.Recipient = MD.Recipient
    JOIN
    Messages M ON MD.Recipient = M.Recipient AND MD.LastDate = M.Date
ORDER BY
    Recipient

I've found aggregates mostly scale better then ranking functions

gbn
+1 My experience too. In order of decreasing readability but increasing performance: ranking function -> aggregates -> cross apply with CTE.
Andomar
A: 

You cannot easily do this is a single query because count(*) is a group function whereas the latest status comes from a specific row. Here is the query to get the latest status for each user:

SELECT M.Recipient, M.Status FROM Messages M
WHERE M.Date = (SELECT MAX(SUB.Date) FROM MESSAGES SUB
    WHERE SUB.Recipient = M.Recipient)
FelixM