tags:

views:

73

answers:

2

I have a table that holds data about items that existed at a certain time - regular snapshots taken.

Simple example:

Timestamp   ID
   1        A
   1        B
   2        A
   2        B
   2        C
   3        A
   3        D
   4        D
   4        E

In this case, Item C gets created sometime between snapshot 1 and 2 and sometime between snapshot 2 and 3 B and C disappear and D gets created, etc.

The table is reasonably large (millions of records) and for each timestamp there are about 50 records.

What's the most efficient way of selecting the item IDs for items that disappear between two consecutive timestamps?

So for the above example ...
Between 1 and 2: NULL
Between 2 and 3: B, C
Between 3 and 4: A

If it doesn't make the query inefficient, can it be extended to automatically use the latest (i.e. MAX) timestamp and the previous one?

+1  A: 

Update:

See this entry in my blog for performance details:

SELECT  ts,
        (
        SELECT  GROUP_CONCAT(id)
        FROM    mytable mi
        WHERE   mi.ts =
                (
                SELECT  MAX(ts)
                FROM    mytable mp
                WHERE   mp.ts = mo.pts
                )
                AND NOT EXISTS
                (
                SELECT  NULL
                FROM    mytable mn
                WHERE   mn.ts = mo.ts
                        AND mn.id = mi.id
                )
        )
FROM    (
        SELECT  @r AS pts,
                @r := ts AS ts
        FROM    (
                SELECT  @r := NULL
                ) vars,
                (
                SELECT  DISTINCT ts
                FROM    mytable
                ) moo
        ) mo

To select only the last change:

SELECT  ts,
        (
        SELECT  GROUP_CONCAT(id)
        FROM    mytable mi
        WHERE   mi.ts =
                (
                SELECT  MAX(ts)
                FROM    mytable mp
                WHERE   mp.ts < mo.ts
                )
                AND NOT EXISTS
                (
                SELECT  NULL
                FROM    mytable mn
                WHERE   mn.ts = mo.ts
                        AND mn.id = mi.id
                )
        )
FROM    (
        SELECT  MAX(ts) AS ts
        FROM    mytable
        ) mo

For this to be efficient, you need to have a composite index on mytable (timestamp, id) (in this order).

Quassnoi
+1  A: 

Another way to view this is that you want to find records that exist in timestamp #1 that do not exist in timestamp #2. The easiest way?

SELECT Timestamp
FROM records AS t1
WHERE NOT EXISTS (SELECT 1 FROM records AS t2 WHERE t2.id = t1.id AND t2.Timestamp = t1.Timestamp + 1)

Of course, I'm exploiting here the fact that your example timestamps are integers, when in reality I imagine they are genuine timestamps. But it turns out the integers work so well for this particular purpose, they'd be really handy to have around. So, perhaps we should make a numbered list of all available timestamps. The easiest way to get that?

CREATE TEMPORARY TABLE timestamp_map AS (
    timestamp_id AS INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    timestamp_value AS DATETIME
);

INSERT INTO timestamp_map (timestamp_value) (SELECT DISTINCT timestamp FROM records ORDER BY timestamp);

(You could also maintain such a table permanently by use of triggers.)

It's a bit out there, but I've gotten similar techniques to work very efficiently in the past for data like what you describe, when lots of back-and-forth subqueries and NOT EXISTS proved too slow.

VoteyDisciple