views:

43

answers:

4

I have a table of data recording certain user events. The results looks something like:

ID    Username EventDate
1     UserA    2010-10-21 16:59:59.367
2     UserA    2010-10-21 17:00:00.114    
3     UserA    2010-10-21 17:00:00.003
4     UserA    2010-10-21 17:00:02.867
5     UserB    2010-10-21 18:43:26.538
6     UserB    2010-10-21 18:47:33.373

I want to run a query that removes all events that occur within 3000 milliseconds of a previous event. Note that milliseconds are relevant.

The resulting table would look like:

ID    Username EventDate
1     UserA    2010-10-21 16:59:59.367
4     UserA    2010-10-21 17:00:02.867
5     UserB    2010-10-21 18:43:26.538
6     UserB    2010-10-21 18:47:33.373

How can I do this?

A: 

By previous event, did you mean insertion? Why dont take a time line as base?

Sachin Chourasiya
All the records are first sorted by time. From this point, I want to look at the first record and delete any subsequent records that occur within 3000 milliseconds. IDs 2 and 3 all occur within 3000 milliseconds of ID 1 and therefore they should be removed.
Alexander Voglund
A: 

You could write a procedure to do this by iterating through the events in order of date, and comparing the value of the previous event to the current one using DATEDIFF. If this is an active system and your goal is to prevent duplicate event logs, it would make more sense to use a trigger to prevent the insertion of any new event in a similar fashion.

jamietre
I'm not trying to prevent duplicate entries. There are other times where I expect events to occur very closely to each other. I could write a procedure like you suggest but iterating through records would be a lengthy process and I am hoping that there is some sort of simple/quick way.
Alexander Voglund
Can you explain the situation a little more? If you need to do this on a regular basis then I think prevention is a better solution then maintenance. If not then the time to run a cleanup procedure once seems irrelevant.
jamietre
+3  A: 

You can use a while loop to remove one row at a time. This avoids the problem where multiple rows are all within 3 seconds of eachother, but not within 3 seconds of the first row.

For example:

declare @t table (ID int, Username varchar(50), EventDate datetime)
insert @t
          select 1,     'UserA',    '2010-10-21 16:59:59.367'
union all select 2,     'UserA',    '2010-10-21 17:00:00.114'    
union all select 3,     'UserA',    '2010-10-21 17:00:00.003'
union all select 4,     'UserA',    '2010-10-21 17:00:02.867'
union all select 5,     'UserB',    '2010-10-21 18:43:26.538'
union all select 6,     'UserB',    '2010-10-21 18:47:33.373'


while 1=1
    begin
    delete  @t
    where   ID =
            (
            select  top 1 t2.ID
            from    @t t2
            where   exists
                    (
                    select  *
                    from    @t t1
                    where   t1.Username = t2.Username
                            and t1.EventDate < t2.EventDate
                            and datediff(millisecond, t1.EventDate, 
                                         t2.EventDate) <= 3000
                    )
            )

    if @@ROWCOUNT = 0 
        break
    end

select * from @t

This prints:

ID  Username    EventDate
1   UserA       2010-10-21 16:59:59.367
4   UserA       2010-10-21 17:00:02.867
5   UserB       2010-10-21 18:43:26.537
6   UserB       2010-10-21 18:47:33.373
Andomar
This looks good. Is there any concern about performance? I'm not looking at a huge amount of data (maybe 10000 rows at most.)
Alexander Voglund
If performance is of the utmost concern, i think it would be faster to just run a single select, iterate through the rows, keep the "last" date value in a variable, and compare the "last" date value to the current row. If you end up deleting a row, then clear the "last" value so that you don't have the "multiple rows close together deleted" problem.
jamietre
+2  A: 

If we're removing these results from a result set, and each users events are treated separately, then the following works (stealing table defn from Andomar's answer):

declare @t table (ID int, Username varchar(50), EventDate datetime)
insert @t
          select 1,     'UserA',    '2010-10-21 16:59:59.367'
union all select 2,     'UserA',    '2010-10-21 17:00:00.114'    
union all select 3,     'UserA',    '2010-10-21 17:00:00.003'
union all select 4,     'UserA',    '2010-10-21 17:00:02.867'
union all select 5,     'UserB',    '2010-10-21 18:43:26.538'
union all select 6,     'UserB',    '2010-10-21 18:47:33.373'

;WITH PerUserIDs AS (
    SELECT ID,Username,EventDate,ROW_NUMBER() OVER (PARTITION BY Username ORDER BY EventDate) as R from @t
), Sequenced AS (
    SELECT ID,Username,EventDate,R from PerUserIDs where R = 1
    union all
    select pui.ID,pui.UserName,pui.EventDate,pui.R
    from
        Sequenced s
            inner join
        PerUserIDs pui
            on
                s.R < pui.R and
                s.Username = pui.Username and
                DATEDIFF(millisecond,s.EventDate,pui.EventDate) >= 3000
    where
        not exists(select * from PerUserIDs anti where anti.R < pui.R and s.R < anti.R and s.Username = anti.username and DATEDIFF(millisecond,s.EventDate,anti.EventDate)>= 3000)
)
select * from Sequenced order by Username,EventDate

If you do need to actually delete, then you can delete from your table where ID not in (select ID from Sequenced)

Damien_The_Unbeliever
This is interesting. I don't want to delete records (I need to update my question) so this is a good option to test. I'll give it a shot now.
Alexander Voglund