tags:

views:

175

answers:

5

We have a project where data is written to a logging table. Now, when investigating a problem, it would be useful to query for the problem row, and also get the surrounding rows, so that we can easily see what led up to the problem. The log table had a timestamp field, so we can order by that.

Essentially I want something like "grep -C" for SQL.

As an example, imagine we have a reference_id column, and a activity_code column. I find a problem with reference ID = 1234, so I want to find the preceding N activities.

Edit: Some example data

Code        Reference   Time
Allocate    ABC1        9:00
Allocate    ABC2        9:01
Problem     MYREF1      9:02
Allocate    ABC3        9:03
Allocate    ABC4        9:03
Problem2    MYREF1      9:04
Allocate    ABC5        9:09

I'm given "MYREF1" as something to look at, but I want to see what's been going on at the same kind of time. I want a query that will pick up the "MYREF1" reference rows, as well as some of the other rows (maybe 1 or 2 of the surrounding or preceding rows). In my example that would be ABC2 and ABC4 if I wanted the preceding rows (analogous to grep -B1)

A: 

If you just want the latest :n rows for 1234:

select timestamp, activity_code
from
( select timestamp, activity_code
  from   log
  where  reference_id=1234
  order by timestamp desc
)
where rownum <= :n;
Tony Andrews
That's not quite what I want - I want all rows (whatever the reference number) that are close to my query row in time.
Greg Reynolds
Is the "reference number" a unique identifier? Doesn't appear to be from your example.
Tony Andrews
It's not unique no, there can be multiple rows with the same reference number (imagine it as an order number or customer ID or something like that)
Greg Reynolds
+1  A: 

Here's a conceptual decomposition of a way to do it.

Annotate your ordered logs with the row number:

WITH ordered_logs AS (
    SELECT ROWNUM r, log_table.*
    FROM log_table 
    ORDER BY timestamp
)
SELECT * FROM ordered_logs;

Look up the row number for the central information we're looking for:

SELECT r r0 FROM ordered_logs
WHERE reference_id = 1234; -- or whatever uniquely identifies your problem

Browse a few rows around it:

SELECT * FROM ordered_logs, sought WHERE r BETWEEN r0 - 5 AND r0 + 5;

Putting it all back together:

WITH
  ordered_logs AS (
    SELECT ROWNUM r, log_table.*
    FROM log_table 
    ORDER BY timestamp
  ),
  sought AS (
    SELECT r r0
    FROM ordered_logs
    WHERE reference_id = 1234
  )
SELECT *
FROM ordered_logs, sought
WHERE r BETWEEN r0 - 5 AND r0 + 5;

A lot of optimization is possible, but this is the simplest way of describing it I could find.

JB
A: 
create table grep_like (
  id  number,
  dt  date,
  txt varchar2(10)
);


insert into grep_like values(10, sysdate -  9/24/60/60, 'foo');
insert into grep_like values(30, sysdate -  8/24/60/60, 'bar');
insert into grep_like values(39, sysdate -  2/24/60/60, 'baz');
insert into grep_like values(22, sysdate -  5/24/60/60, '***');
insert into grep_like values(87, sysdate -  7/24/60/60, '###');
insert into grep_like values(57, sysdate -  4/24/60/60, '!!!');
insert into grep_like values(32, sysdate +  1/24/60/60, '---');
insert into grep_like values(99, sysdate - 12/24/60/60, '...');
insert into grep_like values(18, sysdate -  1/24/60/60, 'noo');
insert into grep_like values(20, sysdate - 10/24/60/60, 'moo');
insert into grep_like values(81, sysdate -  0/24/60/60, 'huh');


select p.dt, p.txt
from (
  select r.dt, r.txt, r.r,
         max(case when r.id = 57 then r.r else 0 end) over () p
  from (
    select dt, txt, id,
           row_number() over (order by dt)  r
      from grep_like
  ) r
) p
where 
  p.r - p.p between -1 and 1
;
René Nyffenegger
Thanks - this is almost what I want, but it does a full table scan, which is not really what I want to be doing (it's a large table).
Greg Reynolds
A: 

Another idea would be to grab a few minutes worth' of entries either way like this:

WITH ts as (
    SELECT timestamp
    FROM log_table
    WHERE reference_id = 1234)
SELECT *
FROM log_table join ts
WHERE timestamp > ts.timestamp - 5 minutes
AND timestamp < ts.timestamp + 5 minutes

Of course, the '+/- 5 minutes' has to be implemented according to your db-system.

This may be a bit easier than getting a rownumber and using that to define your "window", but it may not satisfy your requirements.

IronGoofy
+1  A: 

Interesting question.

You could possibly use Analytic Functions to give you a range of times that you might be interested in and then use this range to select from your logging table.

(Haven't had a chance to run this SQL but it should give you the idea).

This gives a TIME for 2 rows before and 2 rows after:

select
    l.code,
    l.reference,
    l.time,
    min(l.time) over (
        order by l.time 
        rows between 2 preceding and current row) 
            preceding_time,
    max(l.time) over (
        order by l.time 
        rows beween current row and 2 following) 
            following_time
from
    log_table l;

You can then use these "time boxes" to select a range in the driving table.

with timebox as
    (
    select
        l.code,
        l.reference,
        l.time,
        min(l.time) over (
            order by l.time 
            rows between 2 preceding and current row) 
                preceding_time,
        max(l.time) over (
            order by l.time 
            rows beween current row and 2 following) 
                following_time
    from
        log_table l
    )
select
    *
from
    log_table a
where
    exists
        (
        select 1 from
            timebox t
        where
            t.reference = 'MYREF1'
        and a.time between t.preceding_time and t.following_time
        );

Is that close to what you're after?

Nick Pierpoint
This is almost it. Frustratingly, when I use the between clause I get a full table scan on my data table, whereas if I just use an = following_time (or similar) clause I get good performance. I have no idea why this is happening, but it's making this solution not work.
Greg Reynolds
You're going to have a full scan in there somewhere as your effectively joining your log_table to your log_table. There will be a full table scan to get the timebox but you should be able to get an index search on a.time. Are you seeing two full table scans?
Nick Pierpoint
If you can't afford any full scan, you'll need an indexed column on which you can rely on its "order by" (so you've already got your 'timebox'). If each row in the real table has a sequential unique ID then you could use that, otherwise you might have to fall back to using a +/- x minutes instead of +/- x rows.
Nick Pierpoint