ansaurus

Question

Answer 1

A:

In this solution, I created a temporary Calendar table which stores a value for every day across a range. This type of table can be made static. In addition, I'm only storing 400 some odd dates starting with 2009-12-31. Obviously, if your dates span a larger range, you would need more values.

In addition, this solution will only work with SQL Server 2005+ in that I'm using a CTE.

With Calendar As
    (
    Select DateAdd(d, ROW_NUMBER() OVER ( ORDER BY s1.object_id ), '1900-01-01') As [Date]
    From sys.columns as s1
        Cross Join sys.columns as s2
    )
    , StopDates As
    (
    Select C.[Date]
    From Calendar As C
        Left Join @T As T
            On C.[Date] Between T.d1 And T.d2
    Where C.[Date] >= ( Select Min(T2.d1) From @T As T2 )
        And C.[Date] <= ( Select Max(T2.d2) From @T As T2 )
        And T.d1 Is Null
    )
    , StopDatesInUse As
    (
    Select D1.[Date]
    From StopDates As D1
        Left Join StopDates As D2
            On D1.[Date] = DateAdd(d,1,D2.Date)
    Where D2.[Date] Is Null
    )
    , DataWithEariestStopDate As 
    (
    Select *
    , (Select Min(SD2.[Date])
        From StopDatesInUse As SD2
        Where T.d2 < SD2.[Date] ) As StopDate
    From @T As T
    )
Select Min(d1), Max(d2)
From DataWithEariestStopDate
Group By StopDate
Order By Min(d1)

EDIT The problem with using dates in 2009 has nothing to do with the final query. The problem is that the Calendar table is not big enough. I started the Calendar table at 2009-12-31. I have revised it start at 1900-01-01.

Thomas 2010-04-07 17:38:22

Your code is merging intervals that are not supposed to be merged. Using this initial intervals /**/ SELECT '2009-01-01', '2009-01-01' UNION SELECT '2009-01-03', '2009-01-03' /**/ the code returns a single period : 2009-01-01 to 2009-01-03. In this case 2009-01-02 should not be included in the resulting interval.

leoinfo 2010-04-07 19:07:35

First, you should add the schema and specifically whether D1 = D2. None of your example data suggests that. Second, if you **add** {2010-01-01,2010-01-01}, to your existing example data, the first range should still be 2010-01-01 to 2010-06-13 because the first entry in your example covers 2010-01-01 to 2010-03-31. Third, if you instead **replace** the first entry in your example with {2010-01-01, 2010-01-01},{2010-03-01, 2010-03-01}, the results of my query are still correct. Making that change, the first two entries come out as {2010-01-01, 2010-01-01}, {2010-03-01, 2010-06-13}.

Thomas 2010-04-07 20:40:08

One more scenario, if you replace all entries with only {2010-01-01,2010-01-01},{2010-03-01,2010-03-01}, you will get those same two entries.

Thomas 2010-04-07 20:42:28

I see a typo in my edit but the results are still correct. If you replace the first entry with {2010-01-01,2010-01-01} and {2010-01-03,2010-01-03}, it does come out correctly in that it creates two ranges which exclude 2010-01-02.

Thomas 2010-04-07 20:44:12

@leoinfo - Added this comment to ensure that you are notified.

Thomas 2010-04-07 20:51:19

@leoinfo - My apologies. I notice now that your breaking example is in 2009. The problem is that in my original example, my Calendar table starts 2009-12-31. All that needs happen is to expand the Calendar table. I have adjusted.

Thomas 2010-04-08 03:24:43

@Thomas - Your single-query code returns the right results. I run some tests though and the only issue now is that the code is very slow for more (like 500+) records, regardless of using temp table or variable table. On my test, merging 500 overlapping intervals it took about 42s/32s (Temp/Variable). Oddly is that merging 500 NON-overlapping intervals took even longer, 65s/161s.

leoinfo 2010-04-19 16:38:48

@Thomas - With my solution, merging 500 overlapping intervals took about 37s/1s (Temp/Variable). Merging 500 NON-overlapping intervals took 150ms/56ms.

leoinfo 2010-04-19 16:40:24

@Thomas - I wander if your code could be further optimized, because it really looks better than mine. Unfortunately, in this form, it seems to be too resource-consuming.

leoinfo 2010-04-19 16:42:42

@leoinfo - Certainly one way to help is to make the Calendar table static. I.e., create the Calendar table outside of the query with a clustered PK on the date column.

Thomas 2010-04-19 18:25:23

@leoinfo - Have you tried using a temp table and creating indexes on d1 and d2? I'm going to guess that the most expensive part of the query is the StopDatesInUse CTE section because of DateAdd which it may not be interpreting as SARGable. You might also consider enabling `DATE_CORRELATION_OPTIMIZATION` (see http://technet.microsoft.com/en-us/library/ms177416.aspx) which will help significantly with queries against the Calendar table.

Thomas 2010-04-19 18:55:36

Answer 2

A:

Try this

;WITH T1 AS
(
    SELECT d1, d2, ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS R
    FROM @T
), NUMS AS
(
    SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS R
    FROM T1 A
    CROSS JOIN T1 B
    CROSS JOIN T1 C
), ONERANGE AS 
(
    SELECT DISTINCT DATEADD(DAY, ROW_NUMBER() OVER(PARTITION BY T1.R ORDER BY (SELECT 0)) - 1, T1.D1) AS ELEMENT
    FROM T1
    CROSS JOIN NUMS
    WHERE NUMS.R <= DATEDIFF(DAY, d1, d2) + 1
), SEQUENCE AS
(
    SELECT ELEMENT, DATEDIFF(DAY, '19000101', ELEMENT) - ROW_NUMBER() OVER(ORDER BY ELEMENT) AS rownum
    FROM ONERANGE
)
SELECT MIN(ELEMENT) AS StartDate, MAX(ELEMENT) as EndDate
FROM SEQUENCE
GROUP BY rownum

The basic idea is to first unroll the existing data, so you get a separate row for each day. This is done in ONERANGE

Then, identify the relationship between how dates increment and the way the row numbers do. The difference remains constant within an existing range/island. As soon as you get to a new data island, the difference between them increases because the date increments by more than 1, while the row number increments by 1.

Chris Bednarski 2010-04-28 11:11:31

ansaurus

tags:

views:

answers:

Merge overlapping date intervals

related questions