tags:

views:

370

answers:

6

I have a situation that I'm sure is quite common and it's really bothering me that I can't figure out how to do it or what to search for to find a relevant example/solution. I'm relatively new to MySQL (have been using MSSQL and PostgreSQL earlier) and every approach I can think of is blocked by some feature lacking in MySQL.

I have a "log" table that simply lists many different events with their timestamp (stored as datetime type). There's lots of data and columns in the table not relevant to this problem, so lets say we have a simple table like this:

CREATE TABLE log (  
  id INT NOT NULL AUTO_INCREMENT,  
  name VARCHAR(16),  
  ts DATETIME NOT NULL,  
  eventtype VARCHAR(25),  
  PRIMARY KEY  (id)  
)

Let's say that some rows have an eventtype = 'start' and others have an eventtype = 'stop'. What I want to do is to somehow couple each "startrow" with each "stoprow" and find the time difference between the two (and then sum the durations per each name, but that's not where the problem lies). Each "start" event should have a corresponding "stop" event occuring at some stage later then the "start" event, but because of problems/bugs/crashed with the data collector it could be that some are missing. In that case I would like to disregard the event without a "partner". That means that given the data:

foo, 2010-06-10 19:45, start  
foo, 2010-06-10 19:47, start  
foo, 2010-06-10 20:13, stop

..I would like to just disregard the 19:45 start event and not just get two result rows both using the 20:13 stop event as the stop time.

I've tried to join the table with itself in different ways, but the key problems for me seems to be to find a way to correctly identify the corresponding "stop" event to the "start" event for the given "name". The problem is exactly the same as you would have if you had table with employees stamping in and out of work and wanted to find out how much they actually were at work. I'm sure there must be well known solutions to this, but I can't seem to find them...

A: 

How about this:

SELECT start_log.ts AS start_time, end_log.ts AS end_time
FROM log AS start_log
INNER JOIN log AS end_log ON (start_log.name = end_log.name AND end_log.ts > start_log.ts)
WHERE NOT EXISTS (SELECT 1 FROM log WHERE log.ts > start_log.ts AND log.ts < end_log.ts)
 AND start_log.eventtype = 'start'
 AND end_log.eventtype = 'stop'

This will find each pair of rows (aliased as start_log and end_log) with no events in between, where the first is always a start and the last is always a stop. Since we disallow intermediate events, a start that's not immediately followed by a stop will naturally be excluded.

VoteyDisciple
To clarify there can be (and will be) lots of events inbetween the two, but not for the "name" in question that is of either "start" or "stop" type. It should still be possible to do something like this by modifying the "disallow" select, but the problem is that this is extremely slow. I've rewritten your query to fit with the actual tables and tested, and even when I limit the query to just one "name" and the time period to just one day, it takes several minutes to return.
Nadar
The return is empty because I didn't rewrite the conditions and there is never a case where the 'start' and 'stop' follow eachother, but I just ran it to test it. I'm not that into how a query uses indexes, all involved fields are indexed but not in the same index. Could that be the reason for the extremely slow response?
Nadar
Yipes. It will definitely take *some* performance hit, but it should not be so great. Running an `EXPLAIN` on it would show what MySQL is doing, exactly, that makes it so slow.
VoteyDisciple
The explain output really doesn't tell me much, but as mentioned above the table has some 100000 rows in it so it's simply just be because of that.
Nadar
Perhaps, but that's an awfully small table to be experiencing problems with. I'd expect an index covering (name, ts, eventtype) to work wonders.
VoteyDisciple
I've added such an index and ran it with my "combined solution" but I can't see any performance gain. That probably means that the existing indexes were already doing their job..?
Nadar
A: 
Nadar
Doesn't anybody have a better approach than the above?
Nadar
+2  A: 

I believe this could be a simpler way to reach your goal:

SELECT
    start_log.name,
    MAX(start_log.ts) AS start_time,
    end_log.ts AS end_time,
    TIMEDIFF(MAX(start_log.ts), end_log.ts)
FROM
    log AS start_log
INNER JOIN
    log AS end_log ON (
            start_log.name = end_log.name
        AND
            end_log.ts > start_log.ts)
WHERE start_log.eventtype = 'start'
AND end_log.eventtype = 'stop'
GROUP BY start_log.name

It should run considerably faster as it eliminates one subquery.

Ivar Bonsaksen
It runs fast indeed, but it has the opposite problem of that of OMG Ponies. It combines a single stop event with two start events in the case where one stop event is missing, thus giving a too high total timediff.
Nadar
Sorry... My bad!You will have to use MAX(start_log.ts) in the TIMEDIFF too, in order to calculate the correct timediff. I've edited the SQL, and it should now produce the desired result.
Ivar Bonsaksen
This method is elegant and clever, but still slow. It took over 52 minutes to complete with my 120,000 row test table, compared to under 6 seconds using my temporary table method. Using EXPLAIN on my test data to get a rough estimate of the number of rows that each method has to process, shows this query as: 49,095 x 24,000 = 1,178,280,000. The temp table method: 120,000 + 50,168 = 170,168. These figures correlate well with my actual results. Assuming the method I used actually produces the correct results (both produce the same number of records), it may be worth a try.
Mike
I don't see how you were able to make this query take 52 minutes...I just added 1.000.000 test-rows with start, stop and some other events to a table like this one, added some indexes (important!), and the query took 15 seconds to produce 150.000 result rows...
Ivar Bonsaksen
It's great that you can get it to run that fast, and leads me to wonder what I am doing differently. I created the table as shown in the question, with indexes on the name, ts and eventtype fields. I must admit, I didn't expect it to take as long as it did, but I have run both your query and mine multiple times, and get the same results every time. In fact, the only reason I came up with my temp table solution was because my first attempt, which was similar to yours, was very slow. I might try it on a different computer, with more resources.
Mike
It runs at reasonable speed, but: You have to swap the arguments for timediff (gives negative results) and it still combines a single stop event with several start events when one or more start events are missing. The MAX in the timediff didn't seem to get rid of that, and that is the core of my problem. The corresponding start and stop evens typically have minutes between them while it is hours between a stop and the next start event. This means that one such "wrong combine" adds hours to the result and ruins the report completely.
Nadar
I don't see how it could possibly combine a stop event without a corresponding start event with anything as long as it only matches events with the same name... I was under the impression that each name was only supposed to be used by one start-stop group (plus some other events in between).
Ivar Bonsaksen
A: 

If you don't mind creating a temporary table*, then I think the following should work well. I have tested it with 120,000 records, and the whole process completes in under 6 seconds. With 1,048,576 records it completed in just under 66 seconds - and that's on an old Pentium III with 128MB RAM:

*In MySQL 5.0 (and perhaps other versions) the temporary table cannot be a true MySQL temporary table, as you cannot refer to a TEMPORARY table more than once in the same query. See here:

http://dev.mysql.com/doc/refman/5.0/en/temporary-table-problems.html

Instead, just drop/create a normal table, as follows:

DROP TABLE IF EXISTS `tmp_log`;
CREATE TABLE `tmp_log` (
    `id` INT NOT NULL,
    `row` INT NOT NULL,
    `name` VARCHAR(16),
    `ts` DATETIME NOT NULL,
    `eventtype` VARCHAR(25),
    INDEX `row` (`row` ASC),
    INDEX `eventtype` (`eventtype` ASC)
);

This table is used to store a sorted and numbered list of rows from the following SELECT query:

INSERT INTO `tmp_log` (
    `id`,
    `row`,
    `name`,
    `ts`,
    `eventtype`
)
SELECT
    `id`,
    @row:=@row+1,
    `name`,
    `ts`,
    `eventtype`
FROM log,
(SELECT @row:=0) row_count
ORDER BY `name`, `id`;

The above SELECT query sorts the rows by name and then id (you could use the timestamp instead of the id, just so long as the start events appear before the stop events). Each row is also numbered. By doing this, matching pairs of events are always next to each other, and the row number of the start event is always one less than the row id of the stop event.

Now select the matching pairs from the list:

SELECT
    start_log.row AS start_row,
    stop_log.row AS stop_row,
    start_log.name AS name,
    start_log.eventtype AS start_event,
    start_log.ts AS start_time,
    stop_log.eventtype AS stop_event,
    stop_log.ts AS end_time,
    TIMEDIFF(stop_log.ts, start_log.ts) AS duration
FROM
    tmp_log AS start_log
INNER JOIN tmp_log AS stop_log
    ON start_log.row+1 = stop_log.row
    AND start_log.name = stop_log.name
    AND start_log.eventtype = 'start'
    AND stop_log.eventtype = 'stop'
ORDER BY start_log.id;

Once you're done, it's probably a good idea to drop the temporary table:

DROP TABLE IF EXISTS `tmp_log`;row

UPDATE

You could try the following idea, which eliminates temp tables and joins altogether by using variables to store values from the previous row. It sorts the rows by name then time stamp, which groups all values with the same name together, and puts each group in time order. I think that this should ensure that all corresponding start/stop events are next to each other.

SELECT id, name, start, stop, TIMEDIFF(stop, start) AS duration FROM (
    SELECT
        id, ts, eventtype,
        (@name <> name) AS new_name,
        @start AS start,
        @start := IF(eventtype = 'start', ts, NULL) AS prev_start,
        @stop  := IF(eventtype = 'stop',  ts, NULL) AS stop,
        @name  := name AS name
    FROM table1 ORDER BY name, ts
) AS tmp, (SELECT @start:=NULL, @stop:=NULL, @name:=NULL) AS vars
WHERE new_name = 0 AND start IS NOT NULL AND stop IS NOT NULL;

I don't know how it will compare to Ivar Bonsaksen's method, but it runs fairly fast on my box.

Here's how I created the test data:

CREATE TABLE  `table1` (
    `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
    `name` VARCHAR(5),
    `ts` DATETIME,
    `eventtype` VARCHAR(5),
    PRIMARY KEY (`id`),
    INDEX `name` (`name`),
    INDEX `ts` (`ts`)
) ENGINE=InnoDB;

DELIMITER //
DROP PROCEDURE IF EXISTS autofill//
CREATE PROCEDURE autofill()
BEGIN
    DECLARE i INT DEFAULT 0;
    WHILE i < 1000000 DO
        INSERT INTO table1 (name, ts, eventtype) VALUES (
            CHAR(FLOOR(65 + RAND() * 26)),
            DATE_ADD(NOW(),
            INTERVAL FLOOR(RAND() * 365) DAY),
            IF(RAND() >= 0.5, 'start', 'stop')
        );
        SET i = i + 1;
    END WHILE;
END;
//
DELIMITER ;

CALL autofill();
Mike
I haven't tested this solution I must admit, since this is to be used for a report, and I can't assume that the user has the right to create tables. Both true temporary tables and the WITH statement would be of great help in this issue, but this is some of the approaches I assume to be unavailable because of MySQL's lack of features.
Nadar
@Nadar: I've added an update which might help.
Mike
A: 

Can you change the data collector? If yes, add a group_id field (with an index) into the log table and write the id of the start event into it (same id for start and end in the group_id). Then you can do

SELECT S.id, S.name, TIMEDIFF(E.ts, S.ts) `diff`
FROM `log` S
    JOIN `log` E ON S.id = E.group_id AND E.eventtype = 'end'
WHERE S.eventtype = 'start'
andrem
A: 

Try this.

select start.name, start.ts start, end.ts end, timediff(end.ts, start.ts) duration from (
    select *, (
        select id from log L2 where L2.ts>L1.ts and L2.name=L1.name order by ts limit 1
    ) stop_id from log L1
) start join log end on end.id=start.stop_id
where start.eventtype='start' and end.eventtype='stop';
Mark Eirich