views:

78

answers:

2

So I have a set of data that includes a column of dates and an ID column. Here's a sample of the date data:

5/30/10 12:00 AM
5/30/10 12:01 AM
5/30/10 1:59 AM
5/30/10 1:59 AM
5/30/10 3:58 AM
5/30/10 3:58 AM
5/30/10 5:57 AM
5/30/10 6:57 AM
5/30/10 7:56 AM
5/30/10 7:56 AM
5/30/10 9:55 AM
5/30/10 11:54 AM

What I'd like to do is create buckets for these rows based on a parameter like "2 hours". These two hour windows would start at the earliest date in the dataset, but would jump to the next starting time as you scanned through the list. For example, the expected output of "buckets" for my list would be:

5/30/10 12:00 AM  1
5/30/10 12:01 AM  1
5/30/10 1:59 AM   1
5/30/10 1:59 AM   1
5/30/10 3:58 AM   2
5/30/10 3:58 AM   2
5/30/10 5:57 AM   2
5/30/10 6:57 AM   3
5/30/10 7:56 AM   3
5/30/10 7:56 AM   3
5/30/10 8:55 AM   3
5/30/10 11:54 AM  4

So you can see that when I get to 3:58 AM, it's in the 2nd group because it's more than 2 hours past 12:00 AM. However 5:57 AM is still in the 2nd group even though it's more than 4 hours past 12:00 AM because the 2nd group base time is 3:58 AM, not 2:00 AM.

I've tried to create the grouping column by using a partition function like this:

FLOOR(DATEDIFF(SECOND, t.BaseCreateDate, t.CreateDate) / t.DedupWindow)

Where BaseCreateDate is the earliest date in my set, CreateDate is the data I listed, and DedupWindow is the 2 hours. However, that gives me fixed 2 hour windows and I can't seem to find math that resets the base as needed through the data.

I have this working in a cursor, but for a couple of reasons I'd like to get it working set based.

A: 

Edit: Here's quite a ridiculous solution that appears to bring back the right results though.

http://cloudexchange.cloudapp.net/stackoverflow/q/2282

 WITH DATE AS
(
SELECT '20100530 00:00:00.000'  AS CreateDate UNION ALL
SELECT '20100530 00:01:00.000' UNION ALL
SELECT '20100530 01:59:00.000' UNION ALL
SELECT '20100530 01:59:00.000' UNION ALL
SELECT '20100530 03:58:00.000' UNION ALL
SELECT '20100530 03:58:00.000' UNION ALL
SELECT '20100530 05:57:00.000' UNION ALL
SELECT '20100530 06:57:00.000' UNION ALL
SELECT '20100530 07:56:00.000' UNION ALL
SELECT '20100530 07:56:00.000' UNION ALL
SELECT '20100530 08:55:00.000' UNION ALL
SELECT '20100530 11:54:00.000'
),PossibleBuckets As
(
SELECT     earlier.CreateDate, MIN(later.CreateDate) AS NextBucket
FROM         Date AS earlier INNER JOIN
                      Date AS later ON later.CreateDate > earlier.CreateDate 
and later.CreateDate > DATEADD(hour, 2, earlier.CreateDate)  
GROUP BY earlier.CreateDate
),
dates AS (
    SELECT   TOP(1)  CreateDate, NextBucket, 1 AS rank
    FROM         PossibleBuckets
    ORDER BY CreateDate
  UNION ALL
  SELECT   PB.CreateDate, PB.NextBucket, Rank + 1 AS Rank
    FROM   PossibleBuckets PB
    JOIN dates on dates.NextBucket = PB.CreateDate
)
SELECT d.CreateDate, COALESCE(x.Rank, (SELECT MAX(Rank)+1 FROM dates), 1) AS Rank
  FROM date d left join dates x on d.CreateDate >= x.CreateDate AND d.CreateDate < x.NextBucket
Martin Smith
Thanks, Martin, but the two hour buckets aren't fixed. They need to be reset based on the value in the tables. So my buckets aren't 12:00:00 AM, 2:00:00 AM, 4:00:00 AM, etc. They're 12:00 AM - 2:00 AM, 3:58 - 5:58, 6:57 - 8:57, and 11:54 - 1:54 PM
Jason
Ah I missed that!
Martin Smith
Thanks, Martin. I'm going to test performance today with each solution on our larger datasets.
Jason
@Jason Cool. Would be interested to hear the results. I wonder whether this is something for which a cursor is actually best.
Martin Smith
+1  A: 

Nasty but works http://cloudexchange.cloudapp.net/stackoverflow/q/2281

Aggregates are not allowed in the recursive part of a CTE.

DECLARE @t AS TABLE (dt datetime) ;
INSERT  INTO @t
VALUES  ('5/30/10 12:00 AM')
INSERT  INTO @t
VALUES  ('5/30/10 12:01 AM')
INSERT  INTO @t
VALUES  ('5/30/10 1:59 AM')
INSERT  INTO @t
VALUES  ('5/30/10 1:59 AM') 
INSERT  INTO @t
VALUES  ('5/30/10 3:58 AM') 
INSERT  INTO @t
VALUES  ('5/30/10 3:58 AM') 
INSERT  INTO @t
VALUES  ('5/30/10 5:57 AM') 
INSERT  INTO @t
VALUES  ('5/30/10 6:57 AM') 
INSERT  INTO @t
VALUES  ('5/30/10 7:56 AM') 
INSERT  INTO @t
VALUES  ('5/30/10 7:56 AM')
INSERT  INTO @t
VALUES  ('5/30/10 8:55 AM')
INSERT  INTO @t
VALUES  ('5/30/10 11:54 AM') ;
WITH    CTE
          AS (
              SELECT    dt
                       ,1 AS bucket
              FROM      @t
              WHERE     dt = (
                              SELECT    MIN(dt)
                              FROM      @t
                             )
              UNION ALL
              SELECT    t.dt -- Cannot use aggregates here
                       ,CTE.bucket + 1 AS bucket
              FROM      CTE
              INNER JOIN @t AS t
                        ON t.dt >= DATEADD(HOUR, 2, CTE.dt)
             ),
        X AS (
              SELECT    dt
                       ,bucket
                       ,ROW_NUMBER() OVER (PARTITION BY BUCKET ORDER BY dt) AS bucket_start
              FROM      CTE
             ),
        Y AS (
              SELECT    *
              FROM      X
              WHERE     bucket_start = 1
             )
    SELECT  t.*
           ,(
             SELECT MAX(bucket)
             FROM   Y
             WHERE  Y.dt <= t.dt
            ) AS bucket
    FROM    @t AS t
Cade Roux
Thanks, Cade. This looks great. I'm testing performance on large datasets today.
Jason