ansaurus

Question

TSQL Finding Order that occurred in 3 consecutive months

Answer 1

+6 A:

Edit: Got rid or the MAX() OVER (PARTITION BY ...) as that seemed to kill performance.

;WITH cte AS ( 
SELECT    CustID  ,
          OrderDate,
          DATEPART(YEAR, OrderDate)*12 + DATEPART(MONTH, OrderDate) AS YM
 FROM     Orders
 ),
 cte1 AS ( 
SELECT    CustID  ,
          OrderDate,
          YM,
          YM - DENSE_RANK() OVER (PARTITION BY CustID ORDER BY YM) AS G
 FROM     cte
 ),
 cte2 As
 (
 SELECT CustID  ,
          MIN(OrderDate) AS Mn,
          MAX(OrderDate) AS Mx
 FROM cte1
GROUP BY CustID, G
HAVING MAX(YM)-MIN(YM) >=2 
 )
SELECT     c.CustName, o.OrderDate, YEAR(o.OrderDate) AS YEAR
FROM         Customers AS c INNER JOIN
                      Orders AS o ON c.CustID = o.CustID
INNER JOIN  cte2 c2 ON c2.CustID = o.CustID and o.OrderDate between Mn and Mx
order by c.CustName, o.OrderDate

Martin Smith 2010-09-18 21:49:32

Needs to use DENSE_RANK, or four+ sales inside of three months will be overlooked.

OMG Ponies 2010-09-18 22:09:55

Perfect grouped islands solution...

Emtucifor 2010-09-20 16:20:01

Martin, I tested your query and it's not giving the right results...

Emtucifor 2010-09-20 20:05:11

Martin, using my test data below, your query is returning pairs of orders that are two months apart, where mx-mn >= 2, but count(*) = 2 instead of >= 3.

Emtucifor 2010-09-20 20:33:03

@Emtucifor - Think that should be fixed now. I needed to partition by `CustID, G` not just `G`. Thanks for letting me know!

Martin Smith 2010-09-20 20:59:22

Martin, I love your solution for its elegance, though to my surprise my "interesting curiosity" solution actually performs better. I think there's a lesson here, which is that the ranking functions do have a sometimes very significant cost! And also, that somewhat contrary to expectations CROSS JOINs can be very effective ways to solve grouping/ranking problems. Now if we had LAG and LEAD windowing functions, perhaps they'd win over everything!

Emtucifor 2010-09-20 22:26:18

@Emtucifor The `MAX()/MIN() OVER (PARTITION BY ...)` seemed to be the main culprit. Getting rid of that improves things somewhat.

Martin Smith 2010-09-21 00:00:06

It improves things a bit more than somewhat! Now your query is the clear winner.

Emtucifor 2010-09-21 00:18:48

Answer 2

+2 A:

Here you go:

select distinct
 CustName
,year(OrderDate) [Year]
,OrderDate
from 
(
select 
 o2.OrderDate [prev]
,o1.OrderDate [curr]
,o3.OrderDate [next]
,c.CustName
from [order] o1 
join [order] o2 on o1.CustId = o2.CustId and datediff(mm, o2.OrderDate, o1.OrderDate) = 1
join [order] o3 on o1.CustId = o3.CustId and o2.OrderId <> o3.OrderId and datediff(mm, o3.OrderDate, o1.OrderDate) = -1
join Customer c on c.CustId = o1.CustId
) t
unpivot
(
    OrderDate for [DateName] in ([prev], [curr], [next])
)
unpvt
order by CustName, OrderDate

Denis Valeev 2010-09-18 21:59:52

Warning: This query is extremely inefficient. :)

Denis Valeev 2010-09-18 22:58:40

Denis, I'm sorry to report that this query doesn't return the correct results when there are two orders by the same customer on the same day.

Emtucifor 2010-09-20 22:09:16

@Emtucifor, I know! But we don't know what @CSharpy needs! :)

Denis Valeev 2010-09-21 06:37:28

Answer 3

+3 A:

Here is my version. I really was presenting this as a mere curiosity, to show another way of thinking about the problem. It turned out to be more useful than that because it performed better than even Martin Smith's cool "grouped islands" solution. Though, once he got rid of some overly expensive aggregate windowing functions and did real aggregates instead, his query started kicking butt.

Solution 1: Runs of 3 months or more, done by checking 1 month ahead and behind and using a semi-join against that.

WITH Months AS (
   SELECT DISTINCT
      O.CustID,
      Grp = DateDiff(Month, '20000101', O.OrderDate)
   FROM
      CustOrder O
), Anchors AS (
   SELECT
      M.CustID,
      Ind = M.Grp + X.Offset
   FROM
      Months M
      CROSS JOIN (
         SELECT -1 UNION ALL SELECT 0 UNION ALL SELECT 1
      ) X (Offset)
   GROUP BY
      M.CustID,
      M.Grp + X.Offset
   HAVING
      Count(*) = 3
)
SELECT
   C.CustName,
   [Year] = Year(OrderDate),
   O.OrderDate
FROM
   Cust C
   INNER JOIN CustOrder O ON C.CustID = O.CustID
WHERE
   EXISTS (
      SELECT 1
      FROM
         Anchors A
      WHERE
         O.CustID = A.CustID
         AND O.OrderDate >= DateAdd(Month, A.Ind, '19991201')
         AND O.OrderDate < DateAdd(Month, A.Ind, '20000301')
   )
ORDER BY
   C.CustName,
   OrderDate;

Solution 2: Exact 3-month patterns. If it is a 4-month or greater run, the values are excluded. This is done by checking 2 months ahead and two months behind (essentially looking for the pattern N, Y, Y, Y, N).

WITH Months AS (
   SELECT DISTINCT
      O.CustID,
      Grp = DateDiff(Month, '20000101', O.OrderDate)
   FROM
      CustOrder O
), Anchors AS (
   SELECT
      M.CustID,
      Ind = M.Grp + X.Offset
   FROM
      Months M
      CROSS JOIN (
         SELECT -2 UNION ALL SELECT -1 UNION ALL SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2
      ) X (Offset)
   GROUP BY
      M.CustID,
      M.Grp + X.Offset
   HAVING
      Count(*) = 3
      AND Min(X.Offset) = -1
      AND Max(X.Offset) = 1
)
SELECT
   C.CustName,
   [Year] = Year(OrderDate),
   O.OrderDate
FROM
   Cust C
   INNER JOIN CustOrder O ON C.CustID = O.CustID
   INNER JOIN Anchors A
      ON O.CustID = A.CustID
      AND O.OrderDate >= DateAdd(Month, A.Ind, '19991201')
      AND O.OrderDate < DateAdd(Month, A.Ind, '20000301')
ORDER BY
   C.CustName,
   OrderDate;

Here's my table-loading script if anyone else wants to play:

IF Object_ID('CustOrder', 'U') IS NOT NULL DROP TABLE CustOrder
IF Object_ID('Cust', 'U') IS NOT NULL DROP TABLE Cust
GO
SET NOCOUNT ON
CREATE TABLE Cust (
  CustID int identity(1,1) NOT NULL PRIMARY KEY CLUSTERED,
  CustName varchar(100) UNIQUE
)

CREATE TABLE CustOrder (
   OrderID int identity(100, 1) NOT NULL PRIMARY KEY CLUSTERED,
   CustID int NOT NULL FOREIGN KEY REFERENCES Cust (CustID),
   OrderDate smalldatetime NOT NULL
)

DECLARE @i int
SET @i = 1000
WHILE @i > 0 BEGIN
   WITH N AS (
      SELECT
         Nm =
            Char(Abs(Checksum(NewID())) % 26 + 65)
            + Char(Abs(Checksum(NewID())) % 26 + 97)
            + Char(Abs(Checksum(NewID())) % 26 + 97)
            + Char(Abs(Checksum(NewID())) % 26 + 97)
            + Char(Abs(Checksum(NewID())) % 26 + 97)
            + Char(Abs(Checksum(NewID())) % 26 + 97)
   )
   INSERT Cust
   SELECT N.Nm
   FROM N
   WHERE NOT EXISTS (
      SELECT 1
      FROM Cust C
      WHERE
         N.Nm = C.CustName
   )

   SET @i = @i - @@RowCount
END
WHILE @i < 50000 BEGIN
   INSERT CustOrder
   SELECT TOP (50000 - @i)
      Abs(Checksum(NewID())) % 1000 + 1,
      DateAdd(Day, Abs(Checksum(NewID())) % 10000, '19900101')
   FROM master.dbo.spt_values
   SET @i = @i + @@RowCount
END

Performance

Here are some performance testing results for the 3-month-or-more queries:

Query     CPU   Reads Duration
Martin 1  2297 299412   2348 
Martin 2   625    285    809
Denis     3641    401   3855
Emtucifor 1855  94727   2077

This is only one run of each, but the numbers are fairly representative. It turns out that your query wasn't so badly-performing, Denis, after all. Martin's query beats the others hands down, but at first was using some overly-expensive windowing functions strategies that he fixed.

Of course, as I noted, Denis's query isn't pulling the right rows when a customer has two orders on the same day, so his query is out of contention unless he fixed is.

Also, different indexes could possibly shake things up. I don't know.

Emtucifor 2010-09-20 20:13:22

Don't make me add two more joins to my solution, it's already three dimensional. :P

Denis Valeev 2010-09-20 21:22:57

You need to update your performance chart!

Martin Smith 2010-09-20 23:57:25

Done. I left the stats on your older version in just to show that not all windowing function operations are so great. Used indiscriminately they can hurt performance.

Emtucifor 2010-09-21 00:19:40

ansaurus

tags:

views:

answers:

TSQL Finding Order that occurred in 3 consecutive months

related questions