views:

66

answers:

5

I have a table schema similar to the following (simplified):

CREATE TABLE Transactions
(
    TransactionID int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
    CustomerID int NOT NULL,  -- Foreign key, not shown
    TransactionDate datetime NOT NULL,
    ...
)

CREATE INDEX IX_Transactions_Customer_Date
ON Transactions (CustomerID, TransactionDate)

To give a bit of background here, this transaction table is actually consolidating several different types of transactions from another vendor's database (we'll call it an ETL process), and I therefore don't have a great deal of control over the order in which they get inserted. Even if I did, transactions may be backdated, so the important thing to note here is that the maximum TransactionID for any given customer is not necessarily the most recent transaction.

In fact, the most recent transaction is a combination of the date and the ID. Dates are not unique - the vendor often truncates the time of day - so to get the most recent transaction, I have to first find the most recent date, and then find the most recent ID for that date.

I know that I can do this with a windowing query (ROW_NUMBER() OVER (PARTITION BY TransactionDate DESC, TransactionID DESC)), but this requires a full index scan and a very expensive sort, and thus fails miserably in terms of efficiency. It's also pretty awkward to keep writing all the time.

Slightly more efficient is using two CTEs or nested subqueries, one to find the MAX(TransactionDate) per CustomerID, and another to find the MAX(TransactionID). Again, it works, but requires a second aggregate and join, which is slightly better than the ROW_NUMBER() query but still rather painful performance-wise.

I've also considered using a CLR User-Defined Aggregate and will fall back on that if necessary, but I'd prefer to find a pure SQL solution if possible to simplify the deployment (there's no need for SQL-CLR anywhere else in this project).

So the question, specifically is:

Is it possible to write a query that will return the newest TransactionID per CustomerID, defined as the maximum TransactionID for the most recent TransactionDate, and achieve a plan equivalent in performance to an ordinary MAX/GROUP BY query?

(In other words, the only significant steps in the plan should be an index scan and stream aggregate. Multiple scans, sorts, joins, etc. are likely to be too slow.)

+1  A: 

The most useful index might be:

CustomerID, TransactionDate desc, TransactionId desc

Then you could try a query like this:

select  a.CustomerID
,       b.TransactionID
from    (
        select  distinct
                CustomerID
        from    YourTable
        ) a
cross apply   
        (
        select  top 1
                TransactionID
        from    YourTable
        where   CustomerID = a.CustomerID
        order by
                TransactionDate desc,
                TransactionId desc
        ) b
Andomar
Should be `CROSS APPLY` otherwise it won't parse. This is something I hadn't thought of; I just tested it and it seems to be on about the same order as the `ROW_NUMBER` solution. :( No sort, but an `Index Seek` appears in addition to the full scan (which I'd expect to be much faster but it turns out to be 3 times as expensive as the scan). Still, +1 for coming up with something I hadn't considered.
Aaronaught
@Aaronaught: Edited, and perhaps a DESC index is slightly faster. Note that this solution makes little sense without the index.
Andomar
Actually, the index doesn't make a huge difference vs. the one I have - there are relatively few (but more than one) transactions per `CustomerID, TransactionDate` pair. A `DESC` index does improve things (it also improves the `ROW_NUMBER()`) query; still, I'd prefer not to basically duplicate an existing index and I'm pretty convinced that there's a way to do this in a single scan.
Aaronaught
+1  A: 

Disclaimer: Thinking out loud :)

Could you have an indexed, computed column that combines the TransactionDate and TransactionID columns into a form that means finding the latest transaction is just a case of finding the MAX of that single field?

AdaTheDev
Even though this was a little light on implementation details, the *concept* of combining the fields was in fact what was needed to come up with an optimized solution. And since I hate self-accepting, I'll give you the check. ;)
Aaronaught
+1  A: 

How about something like this where you force the optimizer to calculate the derived table first. In my tests, this was less expensive than the two Max comparisons.

Select T.CustomerId, T.TransactionDate, Max(TransactionId)
From Transactions As T
    Join    (
            Select T1.CustomerID, Max(T1.TransactionDate) As MaxDate
            From Transactions As T1
            Group By T1.CustomerId
            ) As Z
        On Z.CustomerId = T.CustomerId
            And Z.MaxDate = T.TransactionDate
Group By T.CustomerId, T.TransactionDate
Thomas
A: 

This one seemed to have good performance statistics:

SELECT
    T1.customer_id,
    MAX(T1.transaction_id) AS transaction_id
FROM
    dbo.Transactions T1
INNER JOIN
(
    SELECT
        T2.customer_id,
        MAX(T2.transaction_date) AS max_dt
    FROM
        dbo.Transactions T2
    GROUP BY
        T2.customer_id
) SQ1 ON
    SQ1.customer_id = T1.customer_id AND
    T1.transaction_date = SQ1.max_dt
GROUP BY
    T1.customer_id
Tom H.
A: 

I think I actually figured it out. @Ada had the right idea and I had the same idea myself, but was stuck on how to form a single composite ID and avoid the extra join.

Since both dates and (positive) integers are byte-ordered, they can not only be concatenated into a BLOB for aggregation but also separated after the aggregate is done.

This feels a little unholy, but it seems to do the trick:

SELECT
    CustomerID,
    CAST(SUBSTRING(MAX(
        CAST(TransactionDate AS binary(8)) + 
        CAST(TransactionID AS binary(4))),
      9, 4) AS int) AS TransactionID
FROM Transactions
GROUP BY CustomerID

That gives me a single index scan and stream aggregate. No need for any additional indexes either, it performs the same as just doing MAX(TransactionID) - which makes sense, obviously, since all of the concatenation is happening inside the aggregate itself.

Aaronaught
Do you still get the same execution plan / performance if you encapsulate the unholiness 'out of sight' in a computed column?
AakashM
@AakashM: Yes, that's actually what I've done. Indexing the computed column doesn't really improve performance, just makes it slightly easier to write queries (I still have to "unwrap" the values).
Aaronaught