ansaurus

Question

SQL query - choosing 'last updated' record in a group, better db design?

Answer 1

+2 A:

separate your Subscriptions into 2 tables:

First will store actual subscriptions list (the subscriptions, that are true for now): Person_Id | Newsletter_Id
Second one will store subscriptions log (its updates or status changes)

zerkms 2010-05-16 23:09:23

I was trying to do this by having another table called Subscriptions_Log with the same structure as the Subscriptions table. I added a BEFORE INSERT trigger to the Subscriptions that copies any existing row with the same Person_ID-Newsletter_ID from Subscriptions to Subscriptions_Log (which works), and then deletes that row from Subscriptions (which doesn't work -- a trigger can't modify a table that's locked due to the INSERT). So, triggers can't do it, what's the right way of doing this?

Jimmy 2010-05-17 19:07:29

why did you insert another record if there is one??? why not just use **UPDATE**?

zerkms 2010-05-17 20:35:02

Good question. There may or may not be an existing subscription record for the given Person_ID-Newsletter_ID; It seemed a bit simpler to always add a record, and let triggers do the moving if there was an existing record. But I suppose it would be more efficient to use INSERT... ON UPDATE!

Jimmy 2010-05-18 00:55:59

Answer 2

+2 A:

I've recently run into a somewhat similar problem.

I'm not an SQL expert, so I can't really give much advice on what's the best design for this. But until the pros chip in, maybe this helps:

SELECT s.Person_ID, s.Newsletter_ID  
FROM (
 SELECT MAX(ID) AS mid
 FROM Subscriptions
 GROUP BY 
  Person_ID,Newsletter_ID
) q
JOIN Subscriptions s
ON q.mid = s.ID
WHERE s.Subscribed = 1

Notice I've added an ID colum to your subscriptions table (I'll explain why in a sec).

Now, let's break down how this works (or how I think it works, anyway; I'd be glad to be corrected if I'm wrong).

First, you retrieve all records for a given person / newsletter. This is what the subquery does (yes, I know you said you'd rather not have subqueries, but I'm not sure you can do it without one). I'm grouping by person_id and newsletter_id. This can return more than one row. Notice I'm selecting MAX(ID). If you use an autoincremental ID and it's safe to assume the row with the highest number in the ID column is the newest one for the group (i.e. if you don't insert ID's manually), this subquery will get you the ID of the last row for each person / newsletter.

So, you can join this with the subscriptions table: the join condition is that the ID of the subscriptions row has to match the MAX id you retrieved from the subquery. Here you are only considering the most recent record for each newsletter/person. Then, you factor out inactive subscriptions by using a WHERE condition.

If you want to restrict the result to a given newsletter (or a given person), add that condition to the WHERE clause.

Indices should help making this query run faster.

Hope this helps.

Added

If for some reason you can't guarantee that MAX(Subscriptions.ID) will correspond to the last inserted row, you can probably do something like this (which follows the same logic, I think, but is a bit more verbose and probably less efficient):

SELECT Person_ID, Newsletter_ID  
FROM (
 SELECT MAX(Updated) AS upd, Newsletter_ID AS nid, Person_ID AS pid 
 FROM Subscriptions
 GROUP BY 
  Person_ID,Newsletter_ID
) q
JOIN Subscriptions s
ON q.pid = s.Person_ID AND q.nid = s.Newsletter_ID and q.upd = s.Updated
WHERE Subscribed = 1

New edit

On second thoughts, the alternative I've added (the one with MAX(Updated)) is wrong, I think. You can't know for sure the selected Newsletter_ID and Person_ID in the subquery will be the Newsletter_ID and Person_ID corresponding to the MAX(Updated) row. Since these columns are used for the join condition, this query could give bogus results.

Juan Pablo Califano 2010-05-16 23:38:44

The first query appears to work, Thanks! I'd still be interested in a query that doesn't rely on the Subscriptions table having an ID column, but I could live with the ID column if I need to.Regarding your note about indices to make it run faster: should I have indices on Subscriptions.Newsletter_ID and .Person_ID (in addition to each table's own ID column)? Any other indices?

Jimmy 2010-05-17 02:57:22

Answer 3

A:

I think your design is pretty good. There is no inherent slowness for subqueries - use them if is the best way to express your query.

Here's the query that gets you all of the latest (i.e. not overridden) directives:

SELECT Person_ID, Newsletter_ID, Subscribed, MAX(Updated)
FROM Subscriptions GROUP BY Person_ID, Newsletter_ID

Then you can use this query as a subquery of another query to get what you want. For your query #1:

SELECT x.Person_ID, x.Newsletter_ID FROM
  (SELECT Person_ID, Newsletter_ID, Subscribed, MAX(Updated) 
   FROM Subscriptions GROUP BY Person_ID, Newsletter_ID) x
WHERE x.Subscribed;

For query #2:

SELECT x.Person_ID FROM
  (SELECT Person_ID, Newsletter_ID, Subscribed, MAX(Updated)
   FROM Subscriptions GROUP BY Person_ID, Newsletter_ID) x
WHERE x.Subscribed AND x.Newsletter_ID = ?

You'll definitely want an index on Newsletter_ID in the Subscriptions table, as this query will likely be very selective.

Edit: Whoops, the Subscriptions column in the subquery can come from an arbitrary row, not the one that generates the MAX(Updated). You have to rejoin with the original table:

SELECT x.Person_ID, x.Newsletter_ID, y.Subscribed FROM
  (SELECT Person_ID, Newsletter_ID, MAX(Updated) as MaxUpdated
   From Subscriptions GROUP by Person_ID, Newsletter_ID) x
  JOIN Subscriptions y WHERE x.Person_ID = y.Person_ID AND
                             x.Newsletter_ID = y.Newsletter_ID AND
                             x.MaxUpdated = y.Updated

Keith Randall 2010-05-17 00:04:33

Thanks; I tried the 'query that gets all of the latest directives', and it doesn't seem to work; In the sample data set, it returns row 2 instead of row 4 (it's like it ignores the MAX() function -- in fact, if I remove MAX() it returns the same thing as with the MAX() function). Now, it *is* possible that I messed up something, but I don't think so... if you want I can give you access to the db to try out?

Jimmy 2010-05-17 00:18:27

There is currently no index on the Subscriptions.Updated column -- does there *need* to be?

Jimmy 2010-05-17 00:24:25

Maybe it's because the `WHERE` clause doesn't read `Subscribed = 1` (or true). `WHERE Subscribed` seems like an uncoditional condition, so to speak...

Juan Pablo Califano 2010-05-17 00:27:24

I marked Subscriptions.Updated as indexed and it made no difference in terms of returned rows -- still (wrongly) returns row 2 instead of 4.

Jimmy 2010-05-17 00:28:17

@Jimmy. But have you changed the `WHERE` clause to `WHERE Subscribed = 1`? Indices could make your query more efficient, but they should not affect its "correctness".

Juan Pablo Califano 2010-05-17 00:33:58

@Juan: I'm using the first query, with no where, just to see if that works. I noticed something else: in the result, it returns the Updated value of row 4 mashed together with the Subscribed value of row 2! AArgh!

Jimmy 2010-05-17 00:41:58

Here's what I'm seeing, if I use this Select statement together with the rest of the first querySELECT Person_ID, Newsletter_ID, Allocation, Updated, MAX(Updated) AS Expr1Person_ID Newsletter_ID Subscribed Updated Expr12 1 true 2010-05-01 2010-05-013 1 true 2010-05-01 2010-05-15 3 2 true 2010-05-10 2010-05-10Noticed how the 2nd row of the result set is a mishmash of rows 2 and 4 from the Subscriptions table

Jimmy 2010-05-17 00:47:40

Well that's kind of hard to read! I'll add to the question

Jimmy 2010-05-17 00:48:08

@Jimmy. I think that's because it's missing a self JOIN (not sure if that's the proper name) on the last row for each person / newsletter. I was helped to solved this in the problem I linked to in my answer. And I also think I solve this problem in the answer I posted here. Have you tried it?

Juan Pablo Califano 2010-05-17 00:48:55

@Juan: I'll try yours next :) sorry, this answer had a simple first step that I wanted to try

Jimmy 2010-05-17 00:56:19

@Jimmy. No problem. Just try the first one if you can add an ID column. The second, I think, sort of works for most cases, but is wrong.

Juan Pablo Califano 2010-05-17 00:59:34

Answer 4

A:

ordered analytical functions" is a standard method for this type of problem. 1M records, no problem... depending on the power of your machine, of course.

MAX( Updated) OVER( PARTITION BY list of fields over which you want "max" )

SELECT
  x.*
FROM
  (
    SELECT
       Person_ID
       , Newsletter_ID
       --, Subscribed
       , Updated
       , MAX(Updated) OVER( PARTITION BY Person_ID, Newsletter_ID, Subscribed) AS myUpdated
   FROM Subscriptions
  ) x 
WHERE Updated = myUpdated

Eddie 2010-05-25 09:28:46

Sounds useful, but mysql doesn't support this :(

Jimmy 2010-06-05 12:47:23

Mea culpa, I didn't check the answer for mySQL.

Eddie 2010-06-08 12:17:44

ansaurus

tags:

views:

answers:

SQL query - choosing 'last updated' record in a group, better db design?

related questions