tags:

views:

204

answers:

2
+3  Q: 

Analytical Query

Am looking for a single query that can convert the following information in the table

name:time    :state
a   :10:00 AM:login
b   :10:05 AM:login
a   :10:06 AM:chatting
a   :10:08 AM:Idle
b   :10:11 AM:chatting
a   :10:10 AM:Logout
b   :10:12 AM:Logout

to something like this (given the time range 10 AM to 10:15 AM as the querying period)

name: State    :Duration
a   : chatting :2 Minutes 
a   : Idle     :2 Minutes
b   : chatting :1 Minute

Can this be done ONLY using SQL? Am using Informix version 11.5

+2  A: 

I'm pretty sure it can be done using only SQL, it's going to take me quite a bit of time to come up with a query for you, I'll edit it in when I'm done. The basic steps I think would first be to calculate the amount of time each one takes (done by taking each entry and joining it to the next entry and subtracting to find the time difference) then a simple group by clause with a sum will easily get it into the form you've described.

Edit: Here what I came up with

SELECT l.userid, l.state, SUM(t.minutes) AS duration
FROM Log l 
INNER JOIN (
    SELECT l1.id, (l2.time - l1.time) AS minutes
    FROM Log l1, Log l2
    WHERE l2.time == ( 
     -- find the next entry --
     SELECT TOP 1 ls.time
     FROM Log ls
     WHERE ls.Time > l1.Time && ls.userid = l1.userid
     ORDER BY ls.Time
    )
) t ON l.id == t.id
GROUP BY l.userid, l.state
ORDER BY l.userid

This is semi-pseudocode, I made up all the table names and things, and you won't be able to just subtract one time from another, you'll probably be using the DATEDIFF function. Besides that though, I think that's the gist of it. I think SQL is one of the most amazing languages, you can do almost anything with little code.

Ray Hidayat
You'll probably need an extra level of query to ensure there is no other log row for the same use between the two. You also can't use ORDER BY in sub-queries (at least, not in IDS). You can subtract one time from another; the result is an interval as required.
Jonathan Leffler
Hm I don't quite understand what you mean by ensuring "there is no other log row for the same use between the two". Could you please elaborate? Having no ORDER BY in sub-queries makes this very difficult! I don't know if there is an alternative way, but I'll think about it.
Ray Hidayat
+3  A: 

It can be done in a single SQL statement. Here's the proof.

Setup

CREATE TEMP TABLE eventtable
(
    name CHAR(3) NOT NULL,
    time DATETIME HOUR TO MINUTE NOT NULL,
    state CHAR(8) NOT NULL
);

INSERT INTO eventtable(name, time, state) VALUES('a', '10:00', 'login');
INSERT INTO eventtable(name, time, state) VALUES('b', '10:05', 'login');
INSERT INTO eventtable(name, time, state) VALUES('a', '10:06', 'chatting');
INSERT INTO eventtable(name, time, state) VALUES('a', '10:08', 'Idle');
INSERT INTO eventtable(name, time, state) VALUES('b', '10:11', 'chatting');
INSERT INTO eventtable(name, time, state) VALUES('a', '10:10', 'Logout');
INSERT INTO eventtable(name, time, state) VALUES('b', '10:12', 'Logout');
INSERT INTO eventtable(name, time, state) VALUES('c', '10:01', 'login');
INSERT INTO eventtable(name, time, state) VALUES('c', '10:02', 'chatting');
INSERT INTO eventtable(name, time, state) VALUES('c', '10:03', 'Idle');
INSERT INTO eventtable(name, time, state) VALUES('c', '10:04', 'Logout');
INSERT INTO eventtable(name, time, state) VALUES('c', '10:05', 'Idle');
INSERT INTO eventtable(name, time, state) VALUES('c', '10:06', 'Logout');
INSERT INTO eventtable(name, time, state) VALUES('c', '10:07', 'Idle');
INSERT INTO eventtable(name, time, state) VALUES('c', '10:08', 'Logout');
INSERT INTO eventtable(name, time, state) VALUES('c', '10:09', 'login');
INSERT INTO eventtable(name, time, state) VALUES('c', '10:11', 'chatting');
INSERT INTO eventtable(name, time, state) VALUES('c', '10:12', 'Idle');
INSERT INTO eventtable(name, time, state) VALUES('c', '10:13', 'chatting');
INSERT INTO eventtable(name, time, state) VALUES('c', '10:14', 'Idle');
INSERT INTO eventtable(name, time, state) VALUES('c', '10:15', 'Logout');

Correct query

Note the conditions. The result table must exclude the periods between 'login' and the first other event; further, it must exclude the period between 'Logout' and the next event (presumably a 'login'). The self-join between the table on the name column and then the asymmetric join on the time column (using '<') ensures that events are in time order. The NOT EXISTS sub-select ensures that only adjacent events are considered. Using BETWEEN AND in the sub-query is a mistake because it includes its end points and it is crucial that r1.time and r2.time are excluded from the range; it took me a few minutes to spot that bug (the query ran but returned no rows, but why?)!

SELECT r1.name, r1.state, r2.TIME - r1.TIME AS duration
    FROM eventtable r1, eventtable r2
    WHERE r1.name = r2.name
      AND r1.time < r2.time
      AND r1.state != 'login'
      AND r1.state != 'Logout'
      AND r1.time BETWEEN DATETIME(10:00) HOUR TO MINUTE
                      AND DATETIME(10:15) HOUR TO MINUTE
      AND r2.time BETWEEN DATETIME(10:00) HOUR TO MINUTE
                      AND DATETIME(10:15) HOUR TO MINUTE
      AND NOT EXISTS (SELECT 1 FROM eventtable r3
                            WHERE r3.time > r1.time AND r3.time < r2.time
                      AND r3.name = r1.name
                      AND r3.name = r2.name);

This produces the answer:

name state      duration
a    chatting   0:02
a    Idle       0:02
b    chatting   0:01

c    chatting   0:01
c    Idle       0:01
c    Idle       0:01
c    Idle       0:01
c    chatting   0:01
c    Idle       0:01
c    chatting   0:01
c    Idle       0:01

The 'duration' value is an INTERVAL HOUR TO MINUTE; if you want a value in just minutes, you have to convert it with a cast (using 4 for the precision to allow for intervals up to 1440 minutes, or 1 day; the data is ambiguous for longer time frames):

(r2.time - r1.time)::INTERVAL MINUTE(4) TO MINUTE

Or:

CAST (r2.time - r1.time AS INTERVAL MINUTE(4) TO MINUTE)

IBM Informix Dynamic Server (IDS) has very verbose notations for time constants. In Standard SQL, you could use TIME as the type and TIME '10:00:00' as a value, but the seconds would be necessary in strict standard SQL. IDS does provide exact types that people want - such as DATETIME HOUR TO MINUTE. You'd also write INTERVAL MINUTE(4) in standard SQL; the 'TO MINUTE' should be optional.

Incorrect query

In my comment to Ray Hidayat's answer, I pointed out that the EXISTS sub-query is necessary to ensure that the events under consideration are contiguous - there are no intervening events. Here's the same query with start and end times added to the output, and the EXISTS clause missing (and 'duration' renamed to 'lapse'):

SELECT r1.name, r1.state, r2.TIME - r1.TIME AS lapse,
       r1.time AS start, r2.time AS end
    FROM eventtable r1, eventtable r2
    WHERE r1.name = r2.name
      AND r1.time < r2.time
      AND r1.state != 'login'
      AND r1.state != 'Logout'
      AND r1.time BETWEEN DATETIME(10:00) HOUR TO MINUTE
                      AND DATETIME(10:15) HOUR TO MINUTE
      AND r2.time BETWEEN DATETIME(10:00) HOUR TO MINUTE
                      AND DATETIME(10:15) HOUR TO MINUTE;

This produces the answer:

name state     lapse start end
a    chatting   0:04 10:06 10:10
a    chatting   0:02 10:06 10:08
a    Idle       0:02 10:08 10:10
b    chatting   0:01 10:11 10:12
c    chatting   0:13 10:02 10:15
c    chatting   0:12 10:02 10:14
c    chatting   0:11 10:02 10:13
c    chatting   0:10 10:02 10:12
c    chatting   0:09 10:02 10:11
c    chatting   0:07 10:02 10:09
c    chatting   0:06 10:02 10:08
c    chatting   0:05 10:02 10:07
c    chatting   0:04 10:02 10:06
c    chatting   0:03 10:02 10:05
c    chatting   0:02 10:02 10:04
c    chatting   0:01 10:02 10:03
c    Idle       0:12 10:03 10:15
c    Idle       0:11 10:03 10:14
c    Idle       0:10 10:03 10:13
c    Idle       0:09 10:03 10:12
c    Idle       0:08 10:03 10:11
c    Idle       0:06 10:03 10:09
c    Idle       0:05 10:03 10:08
c    Idle       0:04 10:03 10:07
c    Idle       0:03 10:03 10:06
c    Idle       0:02 10:03 10:05
c    Idle       0:01 10:03 10:04
c    Idle       0:10 10:05 10:15
c    Idle       0:09 10:05 10:14
c    Idle       0:08 10:05 10:13
c    Idle       0:07 10:05 10:12
c    Idle       0:06 10:05 10:11
c    Idle       0:04 10:05 10:09
c    Idle       0:03 10:05 10:08
c    Idle       0:02 10:05 10:07
c    Idle       0:01 10:05 10:06
c    Idle       0:08 10:07 10:15
c    Idle       0:07 10:07 10:14
c    Idle       0:06 10:07 10:13
c    Idle       0:05 10:07 10:12
c    Idle       0:04 10:07 10:11
c    Idle       0:02 10:07 10:09
c    Idle       0:01 10:07 10:08
c    chatting   0:04 10:11 10:15
c    chatting   0:03 10:11 10:14
c    chatting   0:02 10:11 10:13
c    chatting   0:01 10:11 10:12
c    Idle       0:03 10:12 10:15
c    Idle       0:02 10:12 10:14
c    Idle       0:01 10:12 10:13
c    chatting   0:02 10:13 10:15
c    chatting   0:01 10:13 10:14
c    Idle       0:01 10:14 10:15

This shows how each eligible start row for user 'c' is matched with each eligible end row, giving many spurious rows of data. The NOT EXISTS sub-query is a common theme when dealing with time-based queries. You can find information about these operations in Snodgrass's "Developing Time-Oriented Applications in SQL" (PDF available online at URL), and in Date, Darwen and Lorentzos "Temporal Data and the Relational Model".

Jonathan Leffler
Thanks for the link for Snodgrass's book. Very Informative.
calvinkrishy