views:

144

answers:

3

Please help me to generate the following query. Say I have customer table and order table.

Customer Table

CustID CustName

1      AA     
2      BB
3      CC
4      DD  

Order Table

OrderID  OrderDate          CustID
100      01-JAN-2000        1  
101      05-FEB-2000        1     
102      10-MAR-2000        1 
103      01-NOV-2000        2    
104      05-APR-2001        2 
105      07-MAR-2002        2
106      01-JUL-2003        1
107      01-SEP-2004        4
108      01-APR-2005        4
109      01-MAY-2006        3 
110      05-MAY-2007        1  
111      07-JUN-2007        1
112      06-JUL-2007        1 

I want to find out the customers who have made orders on three successive months. (Query using SQL server 2005 and 2008 is allowed).

The desired output is:

CustName      Year   OrderDate   

    AA        2000  01-JAN-2000       
    AA        2000  05-FEB-2000
    AA        2000  10-MAR-2000

    AA        2007  05-MAY-2007        
    AA        2007  07-JUN-2007        
    AA        2007  06-JUL-2007         
+6  A: 

Edit: Got rid or the MAX() OVER (PARTITION BY ...) as that seemed to kill performance.

;WITH cte AS ( 
SELECT    CustID  ,
          OrderDate,
          DATEPART(YEAR, OrderDate)*12 + DATEPART(MONTH, OrderDate) AS YM
 FROM     Orders
 ),
 cte1 AS ( 
SELECT    CustID  ,
          OrderDate,
          YM,
          YM - DENSE_RANK() OVER (PARTITION BY CustID ORDER BY YM) AS G
 FROM     cte
 ),
 cte2 As
 (
 SELECT CustID  ,
          MIN(OrderDate) AS Mn,
          MAX(OrderDate) AS Mx
 FROM cte1
GROUP BY CustID, G
HAVING MAX(YM)-MIN(YM) >=2 
 )
SELECT     c.CustName, o.OrderDate, YEAR(o.OrderDate) AS YEAR
FROM         Customers AS c INNER JOIN
                      Orders AS o ON c.CustID = o.CustID
INNER JOIN  cte2 c2 ON c2.CustID = o.CustID and o.OrderDate between Mn and Mx
order by c.CustName, o.OrderDate
Martin Smith
Needs to use DENSE_RANK, or four+ sales inside of three months will be overlooked.
OMG Ponies
Perfect grouped islands solution...
Emtucifor
Martin, I tested your query and it's not giving the right results...
Emtucifor
Martin, using my test data below, your query is returning pairs of orders that are two months apart, where mx-mn >= 2, but count(*) = 2 instead of >= 3.
Emtucifor
@Emtucifor - Think that should be fixed now. I needed to partition by `CustID, G` not just `G`. Thanks for letting me know!
Martin Smith
Martin, I love your solution for its elegance, though to my surprise my "interesting curiosity" solution actually performs better. I think there's a lesson here, which is that the ranking functions do have a sometimes very significant cost! And also, that somewhat contrary to expectations CROSS JOINs can be very effective ways to solve grouping/ranking problems. Now if we had LAG and LEAD windowing functions, perhaps they'd win over everything!
Emtucifor
@Emtucifor The `MAX()/MIN() OVER (PARTITION BY ...)` seemed to be the main culprit. Getting rid of that improves things somewhat.
Martin Smith
It improves things a bit more than somewhat! Now your query is the clear winner.
Emtucifor
+2  A: 

Here you go:

select distinct
 CustName
,year(OrderDate) [Year]
,OrderDate
from 
(
select 
 o2.OrderDate [prev]
,o1.OrderDate [curr]
,o3.OrderDate [next]
,c.CustName
from [order] o1 
join [order] o2 on o1.CustId = o2.CustId and datediff(mm, o2.OrderDate, o1.OrderDate) = 1
join [order] o3 on o1.CustId = o3.CustId and o2.OrderId <> o3.OrderId and datediff(mm, o3.OrderDate, o1.OrderDate) = -1
join Customer c on c.CustId = o1.CustId
) t
unpivot
(
    OrderDate for [DateName] in ([prev], [curr], [next])
)
unpvt
order by CustName, OrderDate
Denis Valeev
Warning: This query is extremely inefficient. :)
Denis Valeev
Denis, I'm sorry to report that this query doesn't return the correct results when there are two orders by the same customer on the same day.
Emtucifor
@Emtucifor, I know! But we don't know what @CSharpy needs! :)
Denis Valeev
+3  A: 

Here is my version. I really was presenting this as a mere curiosity, to show another way of thinking about the problem. It turned out to be more useful than that because it performed better than even Martin Smith's cool "grouped islands" solution. Though, once he got rid of some overly expensive aggregate windowing functions and did real aggregates instead, his query started kicking butt.

Solution 1: Runs of 3 months or more, done by checking 1 month ahead and behind and using a semi-join against that.

WITH Months AS (
   SELECT DISTINCT
      O.CustID,
      Grp = DateDiff(Month, '20000101', O.OrderDate)
   FROM
      CustOrder O
), Anchors AS (
   SELECT
      M.CustID,
      Ind = M.Grp + X.Offset
   FROM
      Months M
      CROSS JOIN (
         SELECT -1 UNION ALL SELECT 0 UNION ALL SELECT 1
      ) X (Offset)
   GROUP BY
      M.CustID,
      M.Grp + X.Offset
   HAVING
      Count(*) = 3
)
SELECT
   C.CustName,
   [Year] = Year(OrderDate),
   O.OrderDate
FROM
   Cust C
   INNER JOIN CustOrder O ON C.CustID = O.CustID
WHERE
   EXISTS (
      SELECT 1
      FROM
         Anchors A
      WHERE
         O.CustID = A.CustID
         AND O.OrderDate >= DateAdd(Month, A.Ind, '19991201')
         AND O.OrderDate < DateAdd(Month, A.Ind, '20000301')
   )
ORDER BY
   C.CustName,
   OrderDate;

Solution 2: Exact 3-month patterns. If it is a 4-month or greater run, the values are excluded. This is done by checking 2 months ahead and two months behind (essentially looking for the pattern N, Y, Y, Y, N).

WITH Months AS (
   SELECT DISTINCT
      O.CustID,
      Grp = DateDiff(Month, '20000101', O.OrderDate)
   FROM
      CustOrder O
), Anchors AS (
   SELECT
      M.CustID,
      Ind = M.Grp + X.Offset
   FROM
      Months M
      CROSS JOIN (
         SELECT -2 UNION ALL SELECT -1 UNION ALL SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2
      ) X (Offset)
   GROUP BY
      M.CustID,
      M.Grp + X.Offset
   HAVING
      Count(*) = 3
      AND Min(X.Offset) = -1
      AND Max(X.Offset) = 1
)
SELECT
   C.CustName,
   [Year] = Year(OrderDate),
   O.OrderDate
FROM
   Cust C
   INNER JOIN CustOrder O ON C.CustID = O.CustID
   INNER JOIN Anchors A
      ON O.CustID = A.CustID
      AND O.OrderDate >= DateAdd(Month, A.Ind, '19991201')
      AND O.OrderDate < DateAdd(Month, A.Ind, '20000301')
ORDER BY
   C.CustName,
   OrderDate;

Here's my table-loading script if anyone else wants to play:

IF Object_ID('CustOrder', 'U') IS NOT NULL DROP TABLE CustOrder
IF Object_ID('Cust', 'U') IS NOT NULL DROP TABLE Cust
GO
SET NOCOUNT ON
CREATE TABLE Cust (
  CustID int identity(1,1) NOT NULL PRIMARY KEY CLUSTERED,
  CustName varchar(100) UNIQUE
)

CREATE TABLE CustOrder (
   OrderID int identity(100, 1) NOT NULL PRIMARY KEY CLUSTERED,
   CustID int NOT NULL FOREIGN KEY REFERENCES Cust (CustID),
   OrderDate smalldatetime NOT NULL
)

DECLARE @i int
SET @i = 1000
WHILE @i > 0 BEGIN
   WITH N AS (
      SELECT
         Nm =
            Char(Abs(Checksum(NewID())) % 26 + 65)
            + Char(Abs(Checksum(NewID())) % 26 + 97)
            + Char(Abs(Checksum(NewID())) % 26 + 97)
            + Char(Abs(Checksum(NewID())) % 26 + 97)
            + Char(Abs(Checksum(NewID())) % 26 + 97)
            + Char(Abs(Checksum(NewID())) % 26 + 97)
   )
   INSERT Cust
   SELECT N.Nm
   FROM N
   WHERE NOT EXISTS (
      SELECT 1
      FROM Cust C
      WHERE
         N.Nm = C.CustName
   )

   SET @i = @i - @@RowCount
END
WHILE @i < 50000 BEGIN
   INSERT CustOrder
   SELECT TOP (50000 - @i)
      Abs(Checksum(NewID())) % 1000 + 1,
      DateAdd(Day, Abs(Checksum(NewID())) % 10000, '19900101')
   FROM master.dbo.spt_values
   SET @i = @i + @@RowCount
END

Performance

Here are some performance testing results for the 3-month-or-more queries:

Query     CPU   Reads Duration
Martin 1  2297 299412   2348 
Martin 2   625    285    809
Denis     3641    401   3855
Emtucifor 1855  94727   2077

This is only one run of each, but the numbers are fairly representative. It turns out that your query wasn't so badly-performing, Denis, after all. Martin's query beats the others hands down, but at first was using some overly-expensive windowing functions strategies that he fixed.

Of course, as I noted, Denis's query isn't pulling the right rows when a customer has two orders on the same day, so his query is out of contention unless he fixed is.

Also, different indexes could possibly shake things up. I don't know.

Emtucifor
Don't make me add two more joins to my solution, it's already three dimensional. :P
Denis Valeev
You need to update your performance chart!
Martin Smith
Done. I left the stats on your older version in just to show that not all windowing function operations are so great. Used indiscriminately they can hurt performance.
Emtucifor