tags:

views:

371

answers:

4

Hello,

I have a table with 2 fields: unique ID, user ID (foreign key) and date-time. This is an access-log to a service. I work in SQL Server but I would appreciate agnostic answers.

I would like using SQL to find for a certain user the ID from where the longest gap begins.

So for example, say my values are as follows (simplification for one user):

ID |  User-ID |  Time
----------------------------------
1  |  1       |  11-MAR-09, 8:00am
2  |  1       |  11-MAR-09, 6:00pm
3  |  1       |  13-MAR-09, 7:00pm
4  |  1       |  14-MAR-09, 6:00pm

If I search for the longest gap for user 1 I will get ID 2 (it would also be nice to get the length of the gap right there and then, but much less critical).

What's the most efficient way to achieve this in SQL?

Note: ID is not necessarily sequential.

Thank you

+1  A: 

First, join the table to itself so each record for a given user is paired with any record for that same user.

Then, select only those pairs where the first is before the last, there is no record before the first one, and no record after the last one.

 SELECT t1.id, t1.[user-id], t1.time, (t2.time - t1.time) AS GapTime
 FROM
     t AS t1
     INNER JOIN t AS t2 ON t1.[user-id] = t2.[user-id]
 WHERE
     t1.time < t2.time
     AND NOT EXISTS (SELECT NULL FROM t AS t3 WHERE t3.[user-id] = t1.[user-id]
         AND t3.time > t2.time)
     AND NOT EXISTS (SELECT NULL FROM t AS t4 WHERE t4.[user-id] = t1.[user-id]
         AND t4.time < t1.time)

Caveats:

  1. Does not return users that have 0 or 1 records.
  2. Does not return users where all records have the same date/time.
  3. Will return multiple records for a user if the user has duplicate records on the starting or ending boundary of their largest gap.

If desired, you can fix #2 above by changing "t1.time < t2.time" to "t1.time <= t2.time", which will give you a gap of 0 if there is only one record for the user.

richardtallent
This _is_ essential database agnostic, so +1 :)
Dems
EXISTS (SELECT * FROM x) has been shown to be faster than SELECT NULL in SQL Server. Essentially SQL Server has been tuned for that purpose.
Dems
-1, You are not looking at gaps between sequential points in time, but getting: _select "user-id", min(time), max(time), diff(..) from t group by "user-id_ And the ID matching min(time) for that user-id.
Shannon Severance
Shannon, the OP did not say anything about gaps between *neighboring* records when ranked by ID. Perhaps I'm mis-reading the question.
richardtallent
Dems, I don't think you are right about EXISTS(SELECT *) being faster than EXISTS(SELECT NULL). Can you provide documentation for this? I've seen arguments about EXISTS() versus joins, but nothing about performance based on the selection in an EXISTS() subquery. It should be a trivial difference that is removed by the query optimizer.
richardtallent
@Richard: Look at the example, for the data, he said he'd get ID 2. That answer fits with gaps between neighbors, not gaps between all pairs.
Shannon Severance
@Shannon: it actually fit both, since ID 2 is the earliest record. But the OP clarified, which certainly makes my answer wrong.
richardtallent
+2  A: 

Join ranked Time on one-off rank to get the gap:

with cte_ranked as (
select *, row_number() over (partition by UserId order by Time) as rn
from table)
select l.*, datediff(minute, r.Time, l.Time) as gap_length
from cte_ranked l join cte_ranked r on l.UserId = r.UserId and l.rn = r.rn-1

You can then use many methods to identify the maximum gap, when it started etc.

Update

My original answer was written from a Mac w/o a database to test with. I had some more time to play with this problem and actually test and measure how it performs on a 1M records table. My test table is defined like this:

create table access (id int identity(1,1)
    , UserId int not null
    , Time datetime not null);
create clustered index cdx_access on access(UserID, Time);
go

For selecting the record for any information, my preferred answer so far is this:

with cte_gap as (
    select Id, UserId, a.Time, (a.Time - prev.Time) as gap
    from access a
    cross apply (
        select top(1) Time 
        from access b
        where a.UserId = b.UserId
            and a.Time > b.Time
        order by Time desc) as prev)
, cte_max_gap as (
    select UserId, max(gap) as max_gap
    from cte_gap
    group by UserId)
select g.* 
    from cte_gap g
    join cte_max_gap m on m.UserId = g.UserId and m.max_gap = g.gap
where g.UserId = 42;

From 1M record, ~47k distinct users, the result for this is returned in 1ms on my test puny instance (warm cache), 48 page reads.

If the UserId=42 filter is removed the max gap and time it occurred for every user (with duplicates for multiple max gaps) need 6379139 reads, quite heavy, and takes 14s on my test machine.

The time can be cut in half if only the UserId and max gap is needed (no info when the max gap occurred):

select UserId, max(a.Time-prev.Time) as gap
    from access a
    cross apply (
        select top(1) Time 
        from access b
        where a.UserId = b.UserId
            and a.Time > b.Time
        order by Time desc
    ) as prev
group by UserId

This only needs 3193448 reads, only half compared to previous, and completed in 6 seconds on 1M records. The difference occurs because the previous version needed to evaluate every gap once to find the max one, then evaluate them again to find the ones that equal with the max. Note that for this performance results the structure of the table I proposed with an index on (UserId, Time) is critical.

As for the use of CTEs and 'partitions' (better known as ranking functions): this is all ANSI SQL-99 and is supported by most vendors. The only SQL Server specific construct was the use of the datediff function, which is now removed. I have a feeling some readers understand 'agnostic' as 'least common denominator SQL understood also by my favorite vendor'. Also note that the use of common table expressions and cross apply operator are used solely to improve the readability of the query. Both can be replaced with derived table using a simple, mechanical, replacement. Here is the very same query where the CTEs where replaced with derived tables. I'll let you judge yourselves on its readability compared with the CTE based one:

select g.*
    from (    
        select Id, UserId, a.Time, (a.Time - (
            select top(1) Time 
            from access b
            where a.UserId = b.UserId
                and a.Time > b.Time
            order by Time desc
        )) as gap
        from access a) as g
    join (
        select UserId, max(gap) as max_gap
            from (
                select Id, UserId, a.Time, (a.Time - (
                   select top(1) Time 
                   from access b
                   where a.UserId = b.UserId
                     and a.Time > b.Time
                   order by Time desc
                   )) as gap
            from access a) as cte_gap
        group by UserId) as m on m.UserId = g.UserId and m.max_gap = g.gap
    where g.UserId = 42

Damn, I was hopping will end up more convoluted lol. This is quite readable because it had only two CTEs to start from. Still, on queries with 5-6 derived tables, the CTE form is way, way more readable.

For completeness, here is the same transformation applied to my simplified query (only max gaps, no gap end time and access id):

select UserId, max(gap)
    from (
        select UserId, a.Time-(
            select top(1) Time 
            from access b
            where a.UserId = b.UserId
                and a.Time > b.Time
            order by Time desc) as gap
    from access a) as gaps
group by UserId
Remus Rusanu
Common Table Expressions, partitions, etc, are not database agnostic...
Dems
But if you are implementing on SQL Server, CTE with windowed function may be quite a bit faster. It's good to give both agnostic and specific answers, I think, sometimes when you see the difference in performance, a desire to go with an agnostic approach can disappear.
Shannon Severance
Although this is not a complete answer. Should wrap the select that generates gap_lengh in another named CTE, then rank it by user, and finally select where rank = 1.
Shannon Severance
+1  A: 

Very similar to RichardTallent's answer...

SELECT
   t1.id,
   t1.[user-id],
   t1.time,
   DATEDIFF(s, t1.time, t2.time) AS GapTime
FROM
   t AS t1
INNER JOIN
   t AS t2
      ON  t2.[user-id] = t1.[user-id]
      AND t2.time = (
         SELECT
            MIN(time)
         FROM
            t
         WHERE
            [user-id] = t1.[user-id]
            AND time > t1.time
      )


AS you are only actually using the time value from t2, you can actually re-organise as follows to deal with users with just one entry...

SELECT
   t1.id,
   t1.[user-id],
   t1.time,
   DATEDIFF(
      s,
      t1.time,
      (
         SELECT
            MIN(time)
         FROM
            t
         WHERE
            [user-id] = t1.[user-id]
            AND time > t1.time
      )
   ) AS GapTime
FROM
   t1


Finally, there is the possiblity of multiple entries with the same time stamp. When that happens we need additional info to decide the order allowing us to determine which record is 'next'.

Where there are several entries with the same time stamp, all bar one will have a GapTime of 0:
- '12:00' (Gap of 1 until next entry)
- '12:01' (Gap of 0 until next entry)
- '12:01' (Gap of 0 until next entry)
- '12:01' (Gap of 0 until next entry)
- '12:01' (Gap of 1 until next entry)

- '12:02' (Gap of NULL until next entry)

Only the one which is 'last' will have a non-zero time stamp. Although the question states that the "id" may not be in order, it is the only info we have for to determine which reocrd is 'last' when the timestamps are the same.

SELECT
   t1.id,
   t1.[user-id],
   t1.time,
   DATEDIFF(
      s,
      t1.time,
      (
         SELECT
            MIN(time)
         FROM
            t
         WHERE
            [user-id] = t1.[user-id]
            AND
            (
               (time > t1.time)
               OR
               (time = t1.time AND id > t1.id)
            )
      )
   ) AS GapTime
FROM
   t1
Dems
replace DATEDIFF with whatever function exists in you databsae implementation, the rest should be fairly generic
Dems
Not bad... I went the join route between the start and end record rather than correlated subqueries because it is more flexible if the OP wishes to later select additional information from either side. Both should have similar performance.
richardtallent
+4  A: 

Database-agnostic, something of a variant of richardtallent's, but without the restrictions.

Starting with this setup:

create table test(id int, userid int, time datetime)
insert into test values (1, 1, '2009-03-11 08:00')
insert into test values (2, 1, '2009-03-11 18:00')
insert into test values (3, 1, '2009-03-13 19:00')
insert into test values (4, 1, '2009-03-14 18:00')

(I'm SQL Server 2008 here, but it shouldn't matter)

Running this query:

select 
  starttime.id as gapid, starttime.time as starttime, endtime.time as endtime, 
  /* Replace next line with your DB's way of calculating the gap */
  DATEDIFF(second, starttime.time, endtime.time) as gap
from 
  test as starttime
inner join test as endtime on 
  (starttime.userid = endtime.userid) 
  and (starttime.time < endtime.time) 
left join test as intermediatetime on 
  (starttime.userid = intermediatetime.userid) 
  and (starttime.time < intermediatetime.time) 
  and (intermediatetime.time < endtime.time) 
where 
  (intermediatetime.id is null)

Gives the following:

gapid  starttime                endtime                  gap
1      2009-03-11 08:00:00.000  2009-03-11 18:00:00.000  36000
2      2009-03-11 18:00:00.000  2009-03-13 19:00:00.000  176400
3      2009-03-13 19:00:00.000  2009-03-14 18:00:00.000  82800

You can then just ORDER BY the gap expression descending, and pick the top result.

Some explanation: like richardtallent's answer, you join the table onto itself to find a 'later' record -- this basically pairs all records with ANY of their later records (so pairs 1+2, 1+3, 1+4, 2+3, 2+4, 3+4). Then there's another self-join, this time a left join, to find rows in between the two previously selected so (1+2+null, 1+3+2, 1+4+2, 1+4+3, 2+3+null, 2+4+3, 3+4+null). The WHERE clause, though, filters these out (keeps only the rows with no intermediate row), hence keeping only 1+2+null, 2+3+null, and 3+4+null. Taa-daa!

If you could, potentially, have the same time in there twice (a 'gap' of 0) then you'll need a way to break ties, as Dems points out. If you can use ID as a tie-breaker, then change e.g.

and (starttime.time < intermediatetime.time)

to

and ((starttime.time < intermediatetime.time) 
  or ((starttime.time = intermediatetime.time) and (starttime.id < intermediatetime.id)))

assuming that 'id' is a valid way to break ties.

In fact, if you know that ID will be monotonically increasing (I know you said 'not sequential' - not clear if this means that they don't increase with each row, or just that the IDs of the two relevant entries may not be sequential because e.g. another user has entries in between), you can use ID instead of time in all the comparisons to make this even simpler.

Cowan
+1 for: using meaningful table aliases. (Thank you!) And I liked the outer join to find pare the pairs of dates down to just those that had nothing in between. Never seen that, but makes a lot of sense. And pointing out that datediff is SQL Server specific.It would have been nice to seen this taken all the way through to filtering the result to just showing the info for max(gap)
Shannon Severance
Nice. Upvoting, I like the use of the LEFT OUTER join better than my double-use of a correlated subquery.
richardtallent