views:

402

answers:

2

Is there a better way of merging overlapping date intervals?
The solution I came up with is so simple that now I wonder if someone else has a better idea of how this could be done.

/***** DATA EXAMPLE *****/
DECLARE @T TABLE (d1 DATETIME, d2 DATETIME)
INSERT INTO @T (d1, d2)
        SELECT '2010-01-01','2010-03-31' UNION SELECT '2010-04-01','2010-05-31' 
  UNION SELECT '2010-06-15','2010-06-25' UNION SELECT '2010-06-26','2010-07-10' 
  UNION SELECT '2010-08-01','2010-08-05' UNION SELECT '2010-08-01','2010-08-09' 
  UNION SELECT '2010-08-02','2010-08-07' UNION SELECT '2010-08-08','2010-08-08' 
  UNION SELECT '2010-08-09','2010-08-12' UNION SELECT '2010-07-04','2010-08-16' 
  UNION SELECT '2010-11-01','2010-12-31' UNION SELECT '2010-03-01','2010-06-13' 

/***** INTERVAL ANALYSIS *****/
WHILE (1=1)  BEGIN
  UPDATE t1 SET t1.d2 = t2.d2
  FROM @T AS t1 INNER JOIN @T AS t2 ON 
            DATEADD(day, 1, t1.d2) BETWEEN t2.d1 AND t2.d2 
  IF @@ROWCOUNT = 0 BREAK
END

/***** RESULT *****/
SELECT StartDate = MIN(d1) , EndDate = d2
FROM @T
GROUP BY d2
ORDER BY StartDate, EndDate

/***** OUTPUT *****/
/*****
StartDate   EndDate
2010-01-01  2010-06-13 
2010-06-15  2010-08-16 
2010-11-01  2010-12-31 
*****/
A: 

In this solution, I created a temporary Calendar table which stores a value for every day across a range. This type of table can be made static. In addition, I'm only storing 400 some odd dates starting with 2009-12-31. Obviously, if your dates span a larger range, you would need more values.

In addition, this solution will only work with SQL Server 2005+ in that I'm using a CTE.

With Calendar As
    (
    Select DateAdd(d, ROW_NUMBER() OVER ( ORDER BY s1.object_id ), '1900-01-01') As [Date]
    From sys.columns as s1
        Cross Join sys.columns as s2
    )
    , StopDates As
    (
    Select C.[Date]
    From Calendar As C
        Left Join @T As T
            On C.[Date] Between T.d1 And T.d2
    Where C.[Date] >= ( Select Min(T2.d1) From @T As T2 )
        And C.[Date] <= ( Select Max(T2.d2) From @T As T2 )
        And T.d1 Is Null
    )
    , StopDatesInUse As
    (
    Select D1.[Date]
    From StopDates As D1
        Left Join StopDates As D2
            On D1.[Date] = DateAdd(d,1,D2.Date)
    Where D2.[Date] Is Null
    )
    , DataWithEariestStopDate As 
    (
    Select *
    , (Select Min(SD2.[Date])
        From StopDatesInUse As SD2
        Where T.d2 < SD2.[Date] ) As StopDate
    From @T As T
    )
Select Min(d1), Max(d2)
From DataWithEariestStopDate
Group By StopDate
Order By Min(d1)

EDIT The problem with using dates in 2009 has nothing to do with the final query. The problem is that the Calendar table is not big enough. I started the Calendar table at 2009-12-31. I have revised it start at 1900-01-01.

Thomas
Your code is merging intervals that are not supposed to be merged. Using this initial intervals /**/ SELECT '2009-01-01', '2009-01-01' UNION SELECT '2009-01-03', '2009-01-03' /**/ the code returns a single period : 2009-01-01 to 2009-01-03. In this case 2009-01-02 should not be included in the resulting interval.
leoinfo
First, you should add the schema and specifically whether D1 = D2. None of your example data suggests that. Second, if you **add** {2010-01-01,2010-01-01}, to your existing example data, the first range should still be 2010-01-01 to 2010-06-13 because the first entry in your example covers 2010-01-01 to 2010-03-31. Third, if you instead **replace** the first entry in your example with {2010-01-01, 2010-01-01},{2010-03-01, 2010-03-01}, the results of my query are still correct. Making that change, the first two entries come out as {2010-01-01, 2010-01-01}, {2010-03-01, 2010-06-13}.
Thomas
One more scenario, if you replace all entries with only {2010-01-01,2010-01-01},{2010-03-01,2010-03-01}, you will get those same two entries.
Thomas
I see a typo in my edit but the results are still correct. If you replace the first entry with {2010-01-01,2010-01-01} and {2010-01-03,2010-01-03}, it does come out correctly in that it creates two ranges which exclude 2010-01-02.
Thomas
@leoinfo - Added this comment to ensure that you are notified.
Thomas
@leoinfo - My apologies. I notice now that your breaking example is in 2009. The problem is that in my original example, my Calendar table starts 2009-12-31. All that needs happen is to expand the Calendar table. I have adjusted.
Thomas
@Thomas - Your single-query code returns the right results. I run some tests though and the only issue now is that the code is very slow for more (like 500+) records, regardless of using temp table or variable table. On my test, merging 500 overlapping intervals it took about 42s/32s (Temp/Variable). Oddly is that merging 500 NON-overlapping intervals took even longer, 65s/161s.
leoinfo
@Thomas - With my solution, merging 500 overlapping intervals took about 37s/1s (Temp/Variable). Merging 500 NON-overlapping intervals took 150ms/56ms.
leoinfo
@Thomas - I wander if your code could be further optimized, because it really looks better than mine. Unfortunately, in this form, it seems to be too resource-consuming.
leoinfo
@leoinfo - Certainly one way to help is to make the Calendar table static. I.e., create the Calendar table outside of the query with a clustered PK on the date column.
Thomas
@leoinfo - Have you tried using a temp table and creating indexes on d1 and d2? I'm going to guess that the most expensive part of the query is the StopDatesInUse CTE section because of DateAdd which it may not be interpreting as SARGable. You might also consider enabling `DATE_CORRELATION_OPTIMIZATION` (see http://technet.microsoft.com/en-us/library/ms177416.aspx) which will help significantly with queries against the Calendar table.
Thomas
A: 

Try this

;WITH T1 AS
(
    SELECT d1, d2, ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS R
    FROM @T
), NUMS AS
(
    SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS R
    FROM T1 A
    CROSS JOIN T1 B
    CROSS JOIN T1 C
), ONERANGE AS 
(
    SELECT DISTINCT DATEADD(DAY, ROW_NUMBER() OVER(PARTITION BY T1.R ORDER BY (SELECT 0)) - 1, T1.D1) AS ELEMENT
    FROM T1
    CROSS JOIN NUMS
    WHERE NUMS.R <= DATEDIFF(DAY, d1, d2) + 1
), SEQUENCE AS
(
    SELECT ELEMENT, DATEDIFF(DAY, '19000101', ELEMENT) - ROW_NUMBER() OVER(ORDER BY ELEMENT) AS rownum
    FROM ONERANGE
)
SELECT MIN(ELEMENT) AS StartDate, MAX(ELEMENT) as EndDate
FROM SEQUENCE
GROUP BY rownum

The basic idea is to first unroll the existing data, so you get a separate row for each day. This is done in ONERANGE

Then, identify the relationship between how dates increment and the way the row numbers do. The difference remains constant within an existing range/island. As soon as you get to a new data island, the difference between them increases because the date increments by more than 1, while the row number increments by 1.

Chris Bednarski