tags:

views:

579

answers:

8

My question is about how to write an SQL query to calculate the average time between successive events.

I have a small table:

event Name    |    Time

stage 1       |    10:01
stage 2       |    10:03
stage 3       |    10:06
stage 1       |    10:10
stage 2       |    10:15
stage 3       |    10:21
stage 1       |    10:22
stage 2       |    10:23
stage 3       |    10:29

I want to build a query that get as an answer the average of the times between stage(i) and stage(i+1).

For example, the average time between stage 2 and stage 3 is 5:

(3+6+6)/3 =  5
A: 

Your table design is flawed. HOw can you tell which stage1 goes with which stage2? Without a way to do this, I do not think your query is possible.

HLGEM
its a sequence, ordered by time.
Manu
I just upvoted you but then cancelled. You CAN tell which happens after which! Just look at the time column! :)
Vilx-
HLGEM has a point. We have to presume that this is a serialized process - that is, a Stage 1 can never start while a Stage 3 is running. But in real life most processes are multi-thread/multi-user, so we would need an additional identifier to isolate the streams.
APC
+1  A: 

The easiest way would be to order by time and use a cursor (tsql) for iterating over the data. Since cursors are evil it is advisable to fetch the data ordered by time into your application code and iterate there. There are probably other ways to do this in SQL but they will be very complicated and rely on non-standard language extensions.

Manu
A: 

try this

   Select Avg(e.Time - s.Time)
   From Table s
     Join Table e 
         On e.Time = 
             (Select Min(Time)
              From Table
              Where eventname = s.eventname 
                 And time > s.Time)
         And Not Exists 
             (Select * From Table
              Where eventname = s.eventname 
                 And time < s.Time)

For each record representing a Start of a stage, this sql joins it to the record which represents the end, takes the difference between the end time and the start time, and averages those differences. The Not Exists ensures that he intermediate resultset of start records joined to end records only includes the start records as s... and the first join condition ensures that only the one end record ( the one with the same name and the next time value after the start time) is joined to it...

To see the intermediate resultset after the join, but before the average is taken, run the following:

   Select s.EventName,
       s.Time Startime, e.Time EndTime, 
       (e.Time - s.Time) Elapsed
   From Table s
     Join Table e 
         On e.Time = 
             (Select Min(Time)
              From Table
              Where eventname = s.eventname 
                 And time > s.Time)
         And Not Exists 
             (Select * From Table
              Where eventname = s.eventname 
                 And time < s.Time)
Charles Bretana
I don't get it: What's the use of the 'and not exists' condition? It seems to exlude all but the first event, and cause the code to emit the average of a single value (the first transition time) ...
meriton
The Not Exists is to make sure that the sql only outputs one row for each record that is the start of each eventstage. It filters out the Ending records from being on the left side of the left join - the table aliased as "s" - because for these records there are no other records with the same EventName and with an earlier time. For the Ending records, there is one other record (the start record) so the Not exists filters it out.
Charles Bretana
I'm not sure this works...How do you get stage + 1? This seems to get the interval between stages...Ie// stage 1(B) - stage 1(A). I think the question was how to get the difference between stages (ie// stage 2 - stage 1).
James
A: 
WITH    q AS
        (
        SELECT  'stage 1' AS eventname, CAST('2009-01-01 10:01:00' AS DATETIME) AS eventtime
        UNION ALL
        SELECT  'stage 2' AS eventname, CAST('2009-01-01 10:03:00' AS DATETIME) AS eventtime
        UNION ALL
        SELECT  'stage 3' AS eventname, CAST('2009-01-01 10:06:00' AS DATETIME) AS eventtime
        UNION ALL
        SELECT  'stage 1' AS eventname, CAST('2009-01-01 10:10:00' AS DATETIME) AS eventtime
        UNION ALL
        SELECT  'stage 2' AS eventname, CAST('2009-01-01 10:15:00' AS DATETIME) AS eventtime
        UNION ALL
        SELECT  'stage 3' AS eventname, CAST('2009-01-01 10:21:00' AS DATETIME) AS eventtime
        UNION ALL
        SELECT  'stage 1' AS eventname, CAST('2009-01-01 10:22:00' AS DATETIME) AS eventtime
        UNION ALL
        SELECT  'stage 2' AS eventname, CAST('2009-01-01 10:23:00' AS DATETIME) AS eventtime
        UNION ALL
        SELECT  'stage 3' AS eventname, CAST('2009-01-01 10:29:00' AS DATETIME) AS eventtime
        )
SELECT  (
        SELECT  AVG(DATEDIFF(minute, '2009-01-01', eventtime))
        FROM    q
        WHERE   eventname = 'stage 3'
        ) - 
        (
        SELECT  AVG(DATEDIFF(minute, '2009-01-01', eventtime))
        FROM    q
        WHERE   eventname = 'stage 2'
        )

This relies on the fact that you always have complete groups of the stages and they always go in the same order (that is, stage 1 then stage 2 then stage 3)

Quassnoi
Couldn't you simplify this using avg() rather than sum()? You wouldn't have to divide by the count, then.
meriton
`@meriton`: you're right.
Quassnoi
+1  A: 
Select Avg(differ) from (
 Select s1.r, s2.r, s2.time - s1.time as differ from (
 Select * From (Select rownum as r, inn.time from table inn order by time) s1
 Join (Select rownum as r, inn.time from table inn order by time) s2
 On mod(s2.r, 3) = 2 and s2.r = s1.r + 1
 Where mod(s1.r, 3) = 1)
);

The parameters can be changed as the number of stages changes. This is currently set up to find the average between stages 1 and 2 from a 3 stage process.

EDIT a couple typos

David Oneill
Note - this is for PL/SQL dialect.
Vilx-
I didn't see your solution while writing my own. But if it's an upvote you want - here you go! :)
Vilx-
Thanks. I feel really petty saying that, but I've been stuck just under 500 for a while, and there were several retags I wanted to do.
David Oneill
+6  A: 

Aaaaand with a sprinkle of black magic:

select a.eventName, b.eventName, avg(b.Time-a.Time) from
         (select *, row_number() over (order by time) rn from Table) a
    join (select *, row_number() over (order by time) rn from Table) b on (a.rn=b.rn-1)
group by
    a.eventName, b.eventName

This will give you rows like:

Stage 1   Stage 2     10
Stage 2   Stage 3     20

The first column is the starting event, the second column is the ending event. If there is Event 3 right after Event 1, that will be listed as well. Otherwise you should provide some criteria as to which stage follows which stage, so the times are calculated only between those.

Added: This should work OK on both Transact-SQL (MSSQL, Sybase) and PL/SQL (Oracle, PostgreSQL). However I haven't tested it and there could still be syntax errors. This will NOT work on any edition of MySQL.

Vilx-
+1. Analytical functions.. The power within..
Guru
+1 beautiful solution.
meriton
Actually this query will also give you`stage 3 stage 1 150 `. It's not clear from the requirements whether that is desired. I assumed it was not.
APC
Thanks for commenting on my solution, then stealing it to claim as your own, and not even up-voting mine...
David Oneill
@David: Vilx's code differs from yours (e.g. you don't use group by), is better presented and better explained.
Manu
@APC: The author has not specified a criteria how to distinguish which event comes after which event, and what restarts the "sequence". For all we know, that IS desired.
Vilx-
@David - see my second comment at your solution.
Vilx-
A: 

I can't comment, but I have to agree with HLGEM. While you can tell with the provided data set, the OP should be made aware that relying on only a single set of stages existing at one time may be too optimistic.


event Name    |    Time

stage 1       |    10:01
stage 2       |    10:03
stage 3       |    10:06
stage 1       |    10:10
stage 2       |    10:15
stage 3       |    10:21
stage 1       |    10:22
stage 2       |    10:23
stage 1       |    10:25     --- new stage 1
stage 2       |    10:28     --- new stage 2
stage 3       |    10:29
stage 3       |    10:34     --- new stage 3

We don't know the environment or what is creating the data. It is up to the OP to decide if the table is built correctly.

Oracle would handle this with Analytics. like Vilx's answer.

blacksol
A: 

You don't say which flavour of SQL you want the answer for. This probably means you want the code in SQL Server (as [sql] commonly = [sql-server] in SO tag usage).

But just in case you (or some future seeker) are using Oracle, this kind of query is quite straightforward with analytic functions, in this case LAG(). Check it out:

SQL> select stage_range
  2         , avg(time_diff)/60 as average_time_diff_in_min
  3  from
  4      (
  5          select event_name
  6                 , case when event_name = 'stage 2' then  'stage 1 to 2'
  7                      when event_name = 'stage 3' then  'stage 2 to 3'
  8                      else  '!!!' end as stage_range
  9                 , stage_secs - lag(stage_secs)
 10                              over (order by ts, event_name) as time_diff
 11                 from
 12                     ( select event_name
 13                              , ts
 14                              , to_number(to_char(ts, 'sssss')) as stage_secs
 15                       from timings )
 16      )
 17         where event_name in ('stage 2','stage 3')
 18  group by stage_range
 19  /

STAGE_RANGE  AVERAGE_TIME_DIFF_IN_MIN
------------ ------------------------
stage 1 to 2               2.66666667
stage 2 to 3                        5

SQL>

The change of format in the inner query is necessary because I have stored the TIME column as a DATE datatype, so I convert it into seconds to make the mathematics clearer. An alternate solution would be to work with Day to Second Interval datatype instead. But this solution is really all about LAG().

edit

In my take on this query I have explicitly not calculated the difference between a prior Stage 3 and a subsequent Stage 1. This is a matter of requirement.

APC