ansaurus

Question

How can I efficiently compute the MAX of one column, ordered by another column?

Answer 1

+1 A:

The most useful index might be:

CustomerID, TransactionDate desc, TransactionId desc

Then you could try a query like this:

select  a.CustomerID
,       b.TransactionID
from    (
        select  distinct
                CustomerID
        from    YourTable
        ) a
cross apply   
        (
        select  top 1
                TransactionID
        from    YourTable
        where   CustomerID = a.CustomerID
        order by
                TransactionDate desc,
                TransactionId desc
        ) b

Andomar 2010-06-25 14:47:28

Should be `CROSS APPLY` otherwise it won't parse. This is something I hadn't thought of; I just tested it and it seems to be on about the same order as the `ROW_NUMBER` solution. :( No sort, but an `Index Seek` appears in addition to the full scan (which I'd expect to be much faster but it turns out to be 3 times as expensive as the scan). Still, +1 for coming up with something I hadn't considered.

Aaronaught 2010-06-25 14:52:39

@Aaronaught: Edited, and perhaps a DESC index is slightly faster. Note that this solution makes little sense without the index.

Andomar 2010-06-25 14:57:24

Actually, the index doesn't make a huge difference vs. the one I have - there are relatively few (but more than one) transactions per `CustomerID, TransactionDate` pair. A `DESC` index does improve things (it also improves the `ROW_NUMBER()`) query; still, I'd prefer not to basically duplicate an existing index and I'm pretty convinced that there's a way to do this in a single scan.

Aaronaught 2010-06-25 15:03:53

Answer 2

+1 A:

Disclaimer: Thinking out loud :)

Could you have an indexed, computed column that combines the TransactionDate and TransactionID columns into a form that means finding the latest transaction is just a case of finding the MAX of that single field?

AdaTheDev 2010-06-25 14:48:06

Even though this was a little light on implementation details, the *concept* of combining the fields was in fact what was needed to come up with an optimized solution. And since I hate self-accepting, I'll give you the check. ;)

Aaronaught 2010-07-09 14:10:29

Answer 3

+1 A:

How about something like this where you force the optimizer to calculate the derived table first. In my tests, this was less expensive than the two Max comparisons.

Select T.CustomerId, T.TransactionDate, Max(TransactionId)
From Transactions As T
    Join    (
            Select T1.CustomerID, Max(T1.TransactionDate) As MaxDate
            From Transactions As T1
            Group By T1.CustomerId
            ) As Z
        On Z.CustomerId = T.CustomerId
            And Z.MaxDate = T.TransactionDate
Group By T.CustomerId, T.TransactionDate

Thomas 2010-06-25 15:45:04

Answer 4

A:

This one seemed to have good performance statistics:

SELECT
    T1.customer_id,
    MAX(T1.transaction_id) AS transaction_id
FROM
    dbo.Transactions T1
INNER JOIN
(
    SELECT
        T2.customer_id,
        MAX(T2.transaction_date) AS max_dt
    FROM
        dbo.Transactions T2
    GROUP BY
        T2.customer_id
) SQ1 ON
    SQ1.customer_id = T1.customer_id AND
    T1.transaction_date = SQ1.max_dt
GROUP BY
    T1.customer_id

Tom H. 2010-06-25 15:50:39

Answer 5

A:

I think I actually figured it out. @Ada had the right idea and I had the same idea myself, but was stuck on how to form a single composite ID and avoid the extra join.

Since both dates and (positive) integers are byte-ordered, they can not only be concatenated into a BLOB for aggregation but also separated after the aggregate is done.

This feels a little unholy, but it seems to do the trick:

SELECT
    CustomerID,
    CAST(SUBSTRING(MAX(
        CAST(TransactionDate AS binary(8)) + 
        CAST(TransactionID AS binary(4))),
      9, 4) AS int) AS TransactionID
FROM Transactions
GROUP BY CustomerID

That gives me a single index scan and stream aggregate. No need for any additional indexes either, it performs the same as just doing MAX(TransactionID) - which makes sense, obviously, since all of the concatenation is happening inside the aggregate itself.

Aaronaught 2010-06-25 16:04:50

Do you still get the same execution plan / performance if you encapsulate the unholiness 'out of sight' in a computed column?

AakashM 2010-06-25 16:59:40

@AakashM: Yes, that's actually what I've done. Indexing the computed column doesn't really improve performance, just makes it slightly easier to write queries (I still have to "unwrap" the values).

Aaronaught 2010-06-25 18:52:44

ansaurus

tags:

views:

answers:

How can I efficiently compute the MAX of one column, ordered by another column?

related questions