I have a table schema similar to the following (simplified):
CREATE TABLE Transactions
(
TransactionID int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
CustomerID int NOT NULL, -- Foreign key, not shown
TransactionDate datetime NOT NULL,
...
)
CREATE INDEX IX_Transactions_Customer_Date
ON Transactions (CustomerID, TransactionDate)
To give a bit of background here, this transaction table is actually consolidating several different types of transactions from another vendor's database (we'll call it an ETL process), and I therefore don't have a great deal of control over the order in which they get inserted. Even if I did, transactions may be backdated, so the important thing to note here is that the maximum TransactionID
for any given customer
is not necessarily the most recent transaction.
In fact, the most recent transaction is a combination of the date and the ID. Dates are not unique - the vendor often truncates the time of day - so to get the most recent transaction, I have to first find the most recent date, and then find the most recent ID for that date.
I know that I can do this with a windowing query (ROW_NUMBER() OVER (PARTITION BY TransactionDate DESC, TransactionID DESC)
), but this requires a full index scan and a very expensive sort, and thus fails miserably in terms of efficiency. It's also pretty awkward to keep writing all the time.
Slightly more efficient is using two CTEs or nested subqueries, one to find the MAX(TransactionDate)
per CustomerID
, and another to find the MAX(TransactionID)
. Again, it works, but requires a second aggregate and join, which is slightly better than the ROW_NUMBER()
query but still rather painful performance-wise.
I've also considered using a CLR User-Defined Aggregate and will fall back on that if necessary, but I'd prefer to find a pure SQL solution if possible to simplify the deployment (there's no need for SQL-CLR anywhere else in this project).
So the question, specifically is:
Is it possible to write a query that will return the newest TransactionID
per CustomerID
, defined as the maximum TransactionID
for the most recent TransactionDate
, and achieve a plan equivalent in performance to an ordinary MAX
/GROUP BY
query?
(In other words, the only significant steps in the plan should be an index scan and stream aggregate. Multiple scans, sorts, joins, etc. are likely to be too slow.)