views:

124

answers:

3

I have a large table (millions of rows) where I need to find groups of records based on the presence of a certain column value and where a specified 'timeout' has not occurred. I figure one approach would be to find across the entire table where these 'timeout' gaps have occurred.

Example table:

+----------------+------+
| time           | base |
+----------------+------+
| 1245184797.064 | a    |
| 1245184802.020 | a    |
| 1245184807.103 | b    |
| 1245184812.089 | b    |
| 1245184816.831 | b    |
| 1245184821.856 | a    |
| 1245184821.856 | a    |
| 1245184855.903 | a    |
| 1245184855.903 | b    |
| 1245184858.362 | b    |
| 1245184858.362 | b    |
| 1245184860.360 | a    |
| 1245184860.360 | a    |
| 1245184862.174 | a    |
| 1245184862.174 | b    |
| 1245185001.480 | b    |
| 1245185417.556 | a    |
| 1245185417.844 | a    |
| 1245185419.960 | b    |
| 1245185420.181 | b    |
+----------------+------+

Given this set, how would I quickly find the points in the table where base=a hasn't occurred for a given number of seconds (say 5).

To boil it down, my objective is to find spans of records where base=a HAS occurred consistently without timing out.

+2  A: 

I think this will help you:

SELECT * FROM (
    SELECT t1.[time],
           t1.time - (SELECT MAX(time) FROM my_table t2 WHERE t2.time < t1.time and t2.base = 'a') AS timeout
    FROM my_table t1
    WHERE t1.base = 'a') d
WHERE timeout > 5

And don't forget to create index for this query to be more effective:

CREATE INDEX idx_my_table_time_base ON my_table (time, base)
alygin
+1 Works, remove `WHERE t1.base = 'a'` and replace `t2.base = 'a'` with `t2.base = t1.base` to make it search for all base gaps
Andomar
Yea. Use t2.base = t1.base in any case.
alygin
Also interesting, the time is similar to the above, 6 seconds for a base with 3000 rows. Your suggested index isn't being picked by the optimizer either. Still way too slow.
naturalethic
Try to update statistics with full scan on this index. May be server doesn't think this index is selective enought. If it doesn't help try to use hint in the inner select: "from my_tabe t2 with(index=my_index)". Does it help?
alygin
A: 

One way to approach this is to check for "stretch heads", that is, occurrences of a base with more than 5 seconds since it's last occurrence. This example query joins the table on itself to filter out non-heads:

select    head.* 
from      @t head
left join @t nohead 
on        head.base = nohead.base 
and       head.time - 5 < nohead.time and nohead.time < head.time
where     nohead.base is null
order by  head.[time]

For each row, the left join searches for the same base within the last 5 seconds. The where nohead.base is null clause says such a row may not exist. The effect is a list of when a 5+ second span without a base happens.

It won't list the last gap: you could explicitly add "end time" rows for each base:

<end time>     a
<end time>     b
...

to make the query check end-gaps.

Andomar
Interesting solution, but it still takes way too much time -- 6 seconds on query against a base with 3000 rows.
naturalethic
@naturalethic: I'd say add an index on (base,time)... but 6 seconds for 3000 rows is way too slow even if you didn't define any indexes.
Andomar
A: 

One possibility, if you are using a database that supports windowing/analytic functions is something like this:

select * from (
    select time,
           base,
           time - lag(time) over(partition by base order by time) as interval
    from example) w
where w.interval > 5

This should be able to work from a single scan of a (base,time) index. It works on PostgreSQL 8.4 and I think should also work on SQL Server 2008 and Oracle 10.

araqnid