tags:

views:

118

answers:

4

In general, I need to associate (group) records which are created in similar time periods. If it helps, thinking of the example below as clickstream data where there is no sessionID and I need to build those sessions.

I have the following dataset:

UserId INT,
EventId INT,
DateCreated DATETIME,
BlockId INT

Assume the following data:

{123, 111, '2009-12-01 9:15am', NULL}
{123, 222, '2009-12-01 9:20am', NULL}
{123, 333, '2009-12-01 9:25am', NULL}
{123, 444, '2009-12-03 2:30pm', NULL}
{123, 555, '2009-12-03 2:32pm', NULL}

What I need to do is divide these events up, by user, into temporal buckets. There is a business rule that says anything > 30 minutes should be a new bucket. In the above example, events 111-333 represent a block, i.e. not more than 30 minutes separates them. Likewise, events 444-555 represent a second block.

My current solution uses a cursor and is extremely slow (therefore, unsustainable for the amount of data I need to process). I can post the code but it is pretty simple.

Any ideas?

+1  A: 

Based on comment thread,

A. Buckets are defined by the first record in the bucket, and the first record in each Bucket is defined as any row where the DateCreated is more than 30 minutes after the latest earlier DateCreated. (immediately previous record)

B. The rest of the rows in the bucket are all rows with DateCreated on or after the First Row whose DateCreated is less than 30 minutes after the immediately previous row, and there does not exist a non-qualifying, (or new bucket-defining), row since the specified Bucket-defining row.

In English:

Select The DateCreated of those records wheret he DateCreated is more than 30 minutes after the previous DateCreated and aggregate function of your choice on all the other records in table whose DateCreated is after that bucket-defining datecreated, less than 30 minutes after it's immedialte previous DateCreated, and there are no records between the bucket-defining DateCreated and this one which follow a greater than 30 minute gap.

In SQL:

  Select Z.BucketDefinitionDate , Count(*) RowsInBucket
  From (Select Distinct DateCreated BucketDefinitionDate 
        From Table Ti
        Where DateCreated > DateAdd(minute, 30, 
            (Select Max(DateCreated) From Table
             Where DateCreated < Ti.DateCreated))) Z
    Join Table B 
       On B.DateCreated > Z.BucketDefinitionDate 
           And Not Exists
             (Select * From Table
              Where DateCreated Between Z.BucketDefinitionDate 
                                    And B.DateCreated 
                 And DateCreated > DateAdd(minute, 30,
                     (Select Max(DateCreated) From Table
                      Where DateCreated < B.DateCreated))) 
  Group By Z.BucketDefinitionDate
Charles Bretana
I wouldn't be posting if that solution would work :-) The caveat is that events can (a) occur in contiguous blocks longer than 30 minutes and (b) blocks can cross 30-min boundaries
austincav
You have to have some set-based rule as to what defines a "bucket" or a "block" which will define the bucket... What is that rule
Charles Bretana
Your question states "is a business rule that says anything > 30 minutes should be a new bucket" What does this "> 30 minutes" start from ? The previous row? So that if there is a row every 29 minutes they will all fall in same bucket? or from the first row in the bucket ? or from the last row in the previous bucket ?
Charles Bretana
Yes - per the example, the rule is about the difference between one record and the next in sequential order.
austincav
So more accurately a bucket is a group of events where the beginning and end event are at least 30 minutes removed from the next contiguous (time ordered) event. Does that make sense?I am basically trying to "cluster" them together based on timing.
austincav
And I do not have the buckets yet - the algorithm I want is going to *create* the buckets (whether it is by ID or something else, I don't care... just need a way to group them).
austincav
Maybe you should post the cursor code so we can see what the rules are
Ray
Replacing a poorly phrased comment - what happens if you have an event from the same user every 5 minutes? Do you get 6 events into a bucket, and then start a new bucket? Or do they all go in the same bucket because there is no 30 minute gap between any events? Or something else?
Ray
Based on new comments, I have rewritten query... try it now...
Charles Bretana
A: 

What you can try is

DECLARE @TABLE TABLE(
     ID INT,
     EventID INT,
     DateCreated DATETIME
)

INSERT INTO @TABLE SELECT 123, 111, '2009-12-01 9:15am'
INSERT INTO @TABLE SELECT 123, 222, '2009-12-01 9:20am' 
INSERT INTO @TABLE SELECT 123, 333, '2009-12-01 9:25am'
INSERT INTO @TABLE SELECT 123, 444, '2009-12-03 2:30pm'
INSERT INTO @TABLE SELECT 123, 555, '2009-12-01 2:32pm'

SELECT  ID,
     DATEADD(dd, DATEDIFF(dd,0,DateCreated), 0) DayVal,
     DATEPART(hh, DateCreated) HourPart,
     FLOOR(DATEPART(mi, DateCreated) / 30.) MinBucket
FROM    @TABLE

Now you can group by DayVal, HourPart and MinBucket.

astander
See note for previous answer, I think your solution suffers from same problem.
austincav
You will find this bucketing EXTEMELY difficult once the occurences apreat 5 mins apart. How do you decide where to start a given bucket and where to end? That will then revert to a cursor query...
astander
I know it is difficult. And a cursor is the easy--though slow--answer. I thought the smart people of StackOverflow could help me come up w/ a better answer.
austincav
+1  A: 

Hopefully this will get you going in the right direction. If you're in an SP then using table variables for the StartTimes and EndTimes should make the query much easier to read and understand. This will give you start and end times for your batches, then just join back to your table and you should have it.

;WITH StartTimes AS
(
SELECT DISTINCT
    T1.DateCreated AS StartTime
FROM
    My_Table T1
LEFT OUTER JOIN My_Table T2 ON
    T2.UserID = T1.UserID AND
    T2.EventID = T1.EventID AND
    T2.DateCreated >= DATEADD(mi, -30, T1.DateCreated) AND
    T2.DateCreated < T1.DateCreated
WHERE
    T2.UserID IS NULL
)
SELECT
    StartTimes.StartTime,
    EndTimes.EndTime
FROM
(
SELECT DISTINCT
    T3.DateCreated AS EndTime
FROM
    My_Table T3
LEFT OUTER JOIN My_Table T4 ON
    T4.UserID = T3.UserID AND
    T4.EventID = T3.EventID AND
    T4.DateCreated <= DATEADD(mi, 30, T3.DateCreated) AND
    T4.DateCreated > T3.DateCreated
WHERE
    T4.UserID IS NULL
) AS ET
INNER JOIN StartTimes ST ON
    ST.StartTime <= ET.EndTimes
LEFT OUTER JOIN StartTimes ST2 ON
    ST2.StartTime <= ET.EndTimes AND
    ST2.StartTime > ST.StartTime
WHERE
    ST2.StartTime IS NULL
Tom H.
You may be on to something here! let me know and get back to you. Thanks man.
austincav
A: 

I think I have something for you. it is not a cool single query like Tom H posted, but it seems to work. It uses a table variable as a working table.

declare @table table(
    id int identity(1,1),
    userId int,
    eventId int,dateCreated datetime,
    bucket int
)

insert into @table select 123, 111, '2009-12-01 9:15am', 0
// etc... insert more rows - note that the 'bucket' field is set to 0

declare @next_bucket int
set @next_bucket = 1
update @table 
    set bucket = @next_bucket, @next_bucket = @next_bucket + 1
    from @table as [current]
    where datecreated > dateadd(mi, 30, (select datecreated from @table as previous where [current].id = previous.id + 1))

update @table 
    set bucket =
     coalesce(( select max(bucket)
       from @table as previous
       where previous.id < [current].id
        and bucket <> 0
     ), 1)

    from @table as [current]
    where bucket = 0

-- return the results
select * from @table
Ray