views:

470

answers:

5

I have a simple table of events:

event_id | start_time | end_time

How do I query the maximum number of simultaneous events?

A: 

SELECT COUNT(event_id) AS total FROM table WHERE start_time='x' AND end_time='x'

you should ask more clearly next time.

dusoft
I think his question is quite clear. He wants to know the total events at the time when the most events were started and not yet ended.
Sparr
A: 

I would do this in a number of passes, a very slow solution but there may not be a very fast way to do this. and a solution based on Daniel Paull's answer would be far faster.

Sort your events by start time. Loop through the events and find gaps in which there are no events, group events between these gaps. Loop through every time (at whatever resolution your times are recorded at) within each group and query the events that are ongoing at that time. Depending on the speed of your programming language vs the speed of DB queries, you may look at the overlapping events and skip forward to the first end_time of one of the overlapping events.

Sparr
+2  A: 

Depending on what you mean by simultaneous, as noted by other answers, this may be very similar to this question.

Unfortunately the solution I proposed (which was the accepted answer) would require you to redesign your table. However, it will let you trivially determine maximum number of simultaneous events by examining the "SessionCount" (or similarly named) column.

Daniel Paull
I think he could build the table described in your answer programmatically, it would require building the two intertwined lists of ++ and -- events. An excellent link to a very relevant question.
Sparr
A: 

Since your peak times will always end on an end_time, you can just check those times kind of like Sparr suggested. So do a query to join the same table twice and count the number of rows where an event overlaps at each end_time. Then take the max of that.

This will give you your answer but slowly:

SELECT MAX(overlapAtEnd)
FROM
(
    SELECT 
     COUNT(1) AS overlapAtEnd   
    FROM 
     your_table AS t1, 
     your_table AS t2
    WHERE 
     t1.end_time BETWEEN t2.start_time AND t2.end_time
    GROUP BY t1.event_id
) AS foo

Breaking it up into smaller groups (less to compare against), then getting the max of those smaller groups speeds it up significantly:

SELECT MAX(maxOLP)
FROM
(
    SELECT MAX(olp) AS maxOLP
    FROM
    (
     SELECT 
      MAX(overlapAtEnd) AS maxOLP,
      EXTRACT(HOUR FROM t1.end_time)  AS hr
     FROM
     (
      SELECT 
       COUNT(1) AS overlapAtEnd   
      FROM 
       your_table AS t1, 
       your_table AS t2
      WHERE 
       t1.end_time BETWEEN t2.start_time AND t2.end_time
      GROUP BY t1.event_id
     ) AS foo
     GROUP BY t1.event_id, EXTRACT(HOUR FROM t1.end_time)
    ) AS foo
    GROUP BY hr
) AS foo2

There is a slight drawback to this faster approach... if your events generally span more than an hour the events that end in the next hour, may still overlap but wont be counted. To fix this, simply group by a larger interval like a day or a week. Kind of hairy but it works great and quickly gives you the result it sounds like you're looking for.

I lied about the drawback mentioned above. As it turns out it is 100% accurate and doesn't exclude any events (regardless of what time interval you're grouping by).
A: 

My answer is very similar to Harry's first answer. I would attempt to make a slightly different performance optimisation though... Skip to the end to avoid a rambling explanation of why...

Harry's first answer (Core logic)

SELECT MAX(overlapAtEnd)
FROM
(
    SELECT 
        COUNT(1) AS overlapAtEnd                        
    FROM 
        your_table AS t1, 
        your_table AS t2
    WHERE 
        t1.end_time BETWEEN t2.start_time AND t2.end_time
    GROUP BY t1.event_id
) AS foo

The place that takes the most processing time is the join.

For every record in the table, you pick (t1.end time). You then search the table again for (t1.end_time >= start_time) and for all matching records you search for (t1.end_time <= t1.end_time)

Now, it is very easy for you to create an index on start_time. This makes the first check (t1.end_time >= start_time) much faster;
- An index is a search tree for extremely fast searching
- This makes finding the first matching record very quick
- An index is essentially ordered
- This means it knows "everything after the first match also matches"

The last part though is key, because it means that... Even after using an index to do the first check (t1.end_time >= start_time) we can still be left with a lot of records to make the second check (t1.end_time <= t1.end_time)

[including the end_time in the index doesn't help here, and is discussed shortly]

0, '10:00', '10:04'   COUNT(*) WHERE '10:04' >= start_time  ==  4
1, '10:01', '10:06'   COUNT(*) WHERE '10:06' >= start_time  ==  4
2, '10:02', '10:09'   COUNT(*) WHERE '10:09' >= start_time  ==  5
3, '10:04', '10:07'   COUNT(*) WHERE '10:07' >= start_time  ==  4
4, '10:08', '10:12'   COUNT(*) WHERE '10:12' >= start_time  ==  6
5, '10:12', '10:17'   COUNT(*) WHERE '10:17' >= start_time  ==  7
6, '10:15', '10:18'   COUNT(*) WHERE '10:18' >= start_time  ==  8
7, '10:18', '10:22'   COUNT(*) WHERE '10:22' >= start_time  ==  10
8, '10:19', '10:24'   COUNT(*) WHERE '10:24' >= start_time  ==  10
9, '10:22', '10:25'   COUNT(*) WHERE '10:25' >= start_time  ==  10

=> leaves 68 rows to check the second condition; (t1.end_time <= t1.end_time)

Assuming a relatively smooth distribution of events, each record would (approximately and on average) match with half the table. This means you're doing (n*n/2) checks where n is the number of records in the table. Even at 100 records this gives 5000 checks. At 2000 records you're doing around 2million checks!

The natural inclination is to add the end_time field to the index. This doesn't help, however. The index for (start_time, end_time) creates a search tree down to each unique start_time, then under each unique start_time there is a separate search tree for end_times.

In my example above, every start_time is unique. This means that you still need to do all 68 end_time checks. Only the start_time checks benefited from the index.

What we need to do is try to use the single "start_time" index to do more than we currently are. We need to give the query engine more information.

An example is to use "maximum event duration". For example, we may find that no event lasts longer than 8 minutes. This would give us the following query...

SELECT MAX(overlapAtEnd)
FROM
(
    SELECT 
        COUNT(1) AS overlapAtEnd                        
    FROM 
        your_table AS t1, 
        your_table AS t2
    WHERE 
            t1.end_time >= t2.start_time
        AND t1.end_time <= t2.end_time
        AND t1.end_time <= t2.start_time + [max_event_duration] 
    GROUP BY t1.event_id
) AS foo

Applying the example of 8 minute duration on the example I gave above, we reduce the 68 end_time checks down to 34.

0, '10:00', '10:04'   COUNT(*) WHERE '10:04' BETWEEN start_time AND start_time + 8 == 4
1, '10:01', '10:06'   COUNT(*) WHERE '10:06' BETWEEN start_time AND start_time + 8 == 4
2, '10:02', '10:09'   COUNT(*) WHERE '10:09' BETWEEN start_time AND start_time + 8 == 4
3, '10:04', '10:07'   COUNT(*) WHERE '10:07' BETWEEN start_time AND start_time + 8 == 4
4, '10:08', '10:12'   COUNT(*) WHERE '10:12' BETWEEN start_time AND start_time + 8 == 3
5, '10:12', '10:17'   COUNT(*) WHERE '10:17' BETWEEN start_time AND start_time + 8 == 2
6, '10:15', '10:18'   COUNT(*) WHERE '10:18' BETWEEN start_time AND start_time + 8 == 3
7, '10:18', '10:22'   COUNT(*) WHERE '10:22' BETWEEN start_time AND start_time + 8 == 4
8, '10:19', '10:24'   COUNT(*) WHERE '10:24' BETWEEN start_time AND start_time + 8 == 3
9, '10:22', '10:25'   COUNT(*) WHERE '10:25' BETWEEN start_time AND start_time + 8 == 3

=> leaves 34 rows to check the second condition; (t1.end_time <= t1.end_time)
=> thats half the original 68, and on bigger tables the benefit increases...

Even if we did not know that events are never long than 8 minutes, we could have found it just by checking 10 records. MAX(end_time - start_time) over 10 records would still be faster than check (t1.end_time <= t1.end_time) over 34 combinations of records.

And as the size of the table increases, the benefit increases. In fact, where [max_event_duration] is significantly smaller than the whole time span covered by the table, you change the (n*n/2) square law into something much more like (n*x + n) which is linear.

Dems.

SELECT
   MAX(overlapAtEnd)
FROM
(
    SELECT 
        COUNT(1) AS overlapAtEnd                        
    FROM 
        your_table AS t1, 
        your_table AS t2
    WHERE 
            t2.start_time <= t1.end_time
        AND t2.start_time >= t1.end_time - (SELECT MAX(end_time - start_time) FROM your_table)
        AND t2.end_time   >= t1.end_time
    GROUP BY t1.event_id
) AS foo
Dems