views:

390

answers:

6

Hi,

I have a question related to T-SQL and SQL Server.

Let's say I have a table Orders with 2 columns:

  • ProductId int
  • CustomerId int
  • Date datetime

I want the date of the first order for every product, so I perform this type of query:

SELECT ProductId, MIN(Date) AS FirstOrder FROM Orders GROUP BY ProductId

I have an index on ProductId, including the columns CustomerId and Date to speed up the query (IX_Orders). The query plan looks like a non-clustered index scan on IX_Orders, followed by a stream aggregate (no sort thanks to the index).

Now my problem is that I also want to retrieve the CustomerId associated with the first order for each product (Product 26 was first ordered on Tuesday 25, by customer 12). The tricky part is that I don't want any inner loop in the execution plan, because it would mean on additional read per ProductId in the table, which is highly inefficient.

This should be just possible using the same non-clustered index scan, followed by stream aggregates, however I can't seem to find a query that would do that. Any idea?

Thanks

A: 
SELECT
    o1.productid, 
    o1.date, 
    o1.customerid
FROM
    Orders o1
JOIN
    (select productid, min(date) as orderDate
     from Orders
     group by productid
    ) firstOrder
ON o1.productid = firstOrder.productid

This is the best I can come up with though to be honest, I don't know what the performance characteristics of this query are. If it's no good, I'd probably suggest running two queries to get the information you want.

Josh Smeaton
+1: I think you need to define an alias for min(date) inside your anonymous join; otherwise, this is exactly what I got. Would be good to know if there is a better approach to this.
butterchicken
This query gets the wrong answer as it does not include Date in the join between o1 and firstOrder
Martin Brown
You don't need to include the Date in the join. You only need the product id of the sub-query, since that has the minimum date associated with it. The select returning the actual results returns the date.
Josh Smeaton
+1  A: 
declare @Orders table (
    ProductId int,
    CustomerId int,
    Date datetime
)

insert into @Orders values (1,1,'20090701')
insert into @Orders values (2,1,'20090703')
insert into @Orders values (3,1,'20090702')
insert into @Orders values (1,2,'20090704')
insert into @Orders values (4,2,'20090701')
insert into @Orders values (1,3,'20090706')
insert into @Orders values (2,3,'20090704')
insert into @Orders values (4,3,'20090702')
insert into @Orders values (5,5,'20090703')

select O.* from @Orders O inner join 
(
    select ProductId,
    MIN(Date) MinDate 
    from @Orders 
    group by ProductId
) FO
on FO.ProductId = O.ProductId and FO.MinDate = O.Date

The estimated query plan for this is useless as I'm mocking it with table variables, but the anonymous inner join should be optimised over a subselect.

butterchicken
Your select needs to include FO.MinDate.
pjp
I've never heard of an "anonymous" join, I've always used the term derived table.
KM
this will not work if there are multiple rows with the same min date for the same product. Try it out, add this code to the example: _insert into @Orders values (5,1,'20090703'); insert into @Orders values (5,5,'20090703')_ you'll get product 5 multiple times in the result set.
KM
But isn't that valid? You want the customer ID of the first order for any product, but given the data, if multiple customers order the same product on the same day, I'd say you should get them all back. If a customer orders the same product twice, then perhaps this shouldn't happen - you could make the O.* `distinct` in this case.
butterchicken
OP says "I want the date of the first order for every product", to me that means only list the product one time. The OP also wants the the customer that ordered it. If the OP is using DATETIME columns with actual times (not just days like in your test data) you can probably get the first one, since two orders at the same time is low probability, but can still possibly happen. I would think that if there is a tie, same product ordered at the same time, OP still wants only one row in the result set, but that is my take on it.
KM
I think the point is that the OP wants **more** than the date of the first order for every product: "I also want to retrieve the CustomerId associated with the first order for each product". In which case the OP needs to determine _how_ they will choose a single answer when two are equally valid; it could be as simple as the one with lowest CustomerId.
butterchicken
only the OP can answer what they are after. However, they didn't say they want the first date for each customer. In my solution I only return one row per product and I break ties by using the lowest CustomerId.
KM
A: 

Is IX_Orders sorted by ProductId, then CutomerId, then Date or is it ProductId, then Date, then CustomerId? If it is the former change it to the latter.

In other words don't use this:

create index IX_Orders on Orders (ProductId, CustomerId, Date)

Use this instead:

create index IX_Orders on Orders (ProductId, Date, CustomerId)

Then if you do:

SELECT o1.* 
FROM [Order] o1
JOIN
    (
     SELECT ProductID, Min(Date) as Date
     FROM [Order]
     GROUP BY ProductID
    ) o2
    ON o1.ProductID = o2.ProductID AND o1.Date = o2.Date
ORDER BY ProductID

You end up with just one index scan on IX_Orders however if two customers can order the same product at the same time you could get multiple rows for each product. You can get past this by using the following quiery, but it is less efficient than the first:

WITH cte AS
(
    SELECT ProductID, CustomerID, Date, 
     ROW_NUMBER() OVER(PARTITION BY ProductID ORDER BY Date ASC) AS row
    FROM [Order]
)
SELECT ProductID, CustomerId, Date
FROM cte
WHERE row = 1
ORDER BY ProductID
Martin Brown
+2  A: 

this will handle products that have duplicate dates:

DECLARE @Orders table (ProductId int
                      ,CustomerId int
                      ,Date datetime
                      )

INSERT INTO @Orders VALUES (1,1,'20090701')
INSERT INTO @Orders VALUES (2,1,'20090703')
INSERT INTO @Orders VALUES (3,1,'20090702')
INSERT INTO @Orders VALUES (1,2,'20090704')
INSERT INTO @Orders VALUES (4,2,'20090701')
INSERT INTO @Orders VALUES (1,3,'20090706')
INSERT INTO @Orders VALUES (2,3,'20090704')
INSERT INTO @Orders VALUES (4,3,'20090702')
INSERT INTO @Orders VALUES (5,5,'20090703')  --duplicate dates for product #5
INSERT INTO @Orders VALUES (5,1,'20090703')  --duplicate dates for product #5
INSERT INTO @Orders VALUES (5,5,'20090703')  --duplicate dates for product #5

;WITH MinOrders AS
(SELECT
     o.ProductId, o.CustomerId, o.Date
         ,row_number() over(partition by o.ProductId order by o.ProductId,o.CustomerId) AS RankValue
     FROM @Orders o
     INNER JOIN (SELECT
                     ProductId
                         ,MIN(Date) MinDate 
                     FROM @Orders 
                     GROUP BY ProductId
                ) dt ON o.ProductId=dt.ProductId AND o.Date=dt.MinDate
 )
SELECT
    m.ProductId, m.CustomerId, m.Date
    FROM MinOrders  m
    WHERE m.RankValue=1
    ORDER BY m.ProductId, m.CustomerId

this will return the same results, just use the same declare and inserts as the above code:

;WITH MinOrders AS
(SELECT
     o.ProductId, o.CustomerId, o.Date
         ,row_number() over(partition by o.ProductId order by o.ProductId,o.CustomerId) AS RankValue
     FROM @Orders o
 )
SELECT
    m.ProductId, m.CustomerId, m.Date
    FROM MinOrders  m
    WHERE m.RankValue=1
    ORDER BY m.ProductId, m.CustomerId

You can try out each version to see which will run faster...

KM
Good, there is only one index scan, but it gives a sort in the query execution plan though.
A: 

I do not see a way of doing this nicely without doing a subquery or a windowing function (such as row_number, rank) as the max only looks in one column.

However you can do it not nicely.

SELECT
    productid, 
    min(date), 
cast(
    substring( 
        min(convert(varchar(23),date,21) + cast(customerid as varchar(20)))
              , 24, 44)
    as int) customerid
from 
    orders
group by
    productid

This only works if your customer id has less then 20 digits.

EDIT: group by clause added

David Raznick
Msg 8120, Level 16, State 1, Line 51Column '@orders.ProductId' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
KM
oops, forgot to add the group by clause
David Raznick
+1  A: 

In SQL Server 2005+:

SELECT  oo.*
FROM    (
        SELECT  DISTINCT ProductId
        FROM    Orders
        ) od
CROSS APPLY
        (
        SELECT  TOP 1 ProductID, Date, CustomerID
        FROM    Orders oi
        WHERE   oi.ProductID = od.ProductID
        ORDER BY
                Date DESC
        ) oo

Nominally, the plan for the query contains Nested Loops.

However, the outer loop will use a Index Scan with Stream Aggregate, and the inner loop will contain an Index Seek for the ProductID with a Top.

In fact, the second operation is almost free, since the index page used in the inner loop will most probably reside in the cache because it had just been used for the outer loop.

Here's the test result on 1,000,000 rows (with 100 DISTINCT ProductID's):

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 1 ms.

(строк обработано: 100)
Table 'Orders'. Scan count 103, logical reads 6020, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 234 ms,  elapsed time = 125 ms.

, while this is a result of a mere SELECT DISTINCT query:

SELECT  od.*
FROM    (
        SELECT  DISTINCT ProductId
        FROM    Orders
        ) od

And the statistics:

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 1 ms.

(строк обработано: 100)
Table 'Orders'. Scan count 3, logical reads 5648, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 250 ms,  elapsed time = 125 ms.

As we can see, the performance is the same, and the CROSS APPLY takes just 400 extra logical reads (which most probably will never be physical).

Don't see how it's possible to improve this query anymore.

Also the benefit of this query is that it parallelizes nicely. You may notice that CPU time is twice the elapsed time: it's due to parallelization on my old Core Duo.

A 4-core CPU would complete this query in half of that time.

Solutions using window functions do not parallelize:

SELECT  od.*
FROM    (
        SELECT  ProductId, Date, CustomerID, ROW_NUMBER() OVER (PARTITION BY ProductID ORDER BY Date DESC) AS rn
        FROM    Orders
        ) od
WHERE   rn = 1

, and here are the statistics:

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 1 ms.

(строк обработано: 100)
Table 'Orders'. Scan count 1, logical reads 5123, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 406 ms,  elapsed time = 415 ms.
Quassnoi
Need to change where to: WHERE oi.ProductID = od.ProductID. My query plan, showed a sort and nested loop.
Shannon Severance
I get _Msg 4104, Level 16, State 1, Line 2 The multi-part identifier "oo.ProductID" could not be bound._ to fix change WHERE as @Shannon Severance suggests.
KM
Right, forgot to fix it. `@Shannon`: did you create the index as the `@op` said? `CREATE INDEX IX_orders_pdc ON Orders (ProductID, Date, CustomerID)`
Quassnoi
My index was (ProductId, Date, CustomerId) I will check with (ProductId, Date, CustomerId)
Shannon Severance
@Quassnoi: Moving the columns in the index, I did lose the sort. Still the loop with a seek, as you point out in your answer.
Shannon Severance