tags:

views:

194

answers:

4

Hi,

Let's say I have a MySQL database with 3 tables:

table 1: Persons, with 1 column ID (int)
table 2: Newsletters, with 1 column ID (int)
table 3: Subscriptions, with columns Person_ID (int), Newsletter_ID (int), Subscribed (bool), Updated (Datetime)

Subscriptions.Person_ID points to a Person, and Subscription.Newsletter_ID points to a Newsletter. Thus, each person may have 0 or more subscriptions to 0 or more magazines at once. The table Subscriptions will also store the entire history of each person's subscriptions to each newsletter. If a particular Person_ID-Newsletter_ID pair doesn't have a row in the Subscriptions table, then it's equivalent to that pair having a subscription status of 'false'.

Here is a sample dataset

Persons
ID
1
2
3

Newsletters
ID
1
2
3

Subscriptions
Person_ID  Newsletter_ID  Subscribed  Updated
2                1           true     2010-05-01
3                1           true     2010-05-01
3                2           true     2010-05-10
3                1           false    2010-05-15

Thus, as of 2010-05-16, Person 1 has no subscription, Person 2 has a subscription to Newsletter 1, and Person 3 has a subscription to Newsletter 2. Person 3 had a subscription to Newsletter 1 for a while, but not anymore.

I'm trying to do 2 kinds of query.

  1. A query that shows everyone's active subscriptions as of query time (we can assume that updated will never be in the future -- thus, this means returning the record with the latest 'updated' value for each Person_ID-Newsletter_ID pair, as long as Subscribed is true (if the latest record for a Person_ID-Newsletter_ID pair has a Subscribed status of false, then I don't want that record returned)).

  2. A query that returns all active subscriptions for a specific newsletter - same qualification as in 1. regarding records with 'false' in the Subscribed column.

I don't use SQL/databases often enough to tell if this design is good, or if the SQL queries needed would be slow on a database with, say, 1M records in the Subscriptions table.

I was using the Visual query builder tool in Visual Studio 2010 but I can't even get the query to return the latest updated record for each Person_ID-Newsletter_ID pair.

Is it possible to come up with SQL queries that don't involve using subqueries (presumably because they would become too slow with a larger data set)? If not, would it be a better design to have a separate Subscriptions_History table, and every time a subscription status for a Person_ID-Newsletter-ID pair is added to Subscriptions, any existing record for that pair is moved to Subscriptions_History (that way the Subscriptions table only ever contains the latest status update for any Person_ID-Newsletter_ID pair)?

I'm using .net on Windows, so would it be easier (or the same, or harder) to do this kind of queries using Linq? Entity Framework?

Edit: Here's what happens if I use this query:

SELECT     Person_ID, Newsletter_ID, Allocation, Updated, MAX(Updated) AS Expr1
FROM         subscriptions
GROUP BY Person_ID, Newsletter_ID

I get rows 2 and 4 from the Subscriptions table mishmashed together (in row 2 of the results set below):

Person_ID Newsletter_ID Subscribed Updated     Expr1 
2         1             true       2010-05-01  2010-05-01 
3         1             true       2010-05-01  2010-05-15 
3         2             true       2010-05-10  2010-05-10

Thanks!

+2  A: 

separate your Subscriptions into 2 tables:

  • First will store actual subscriptions list (the subscriptions, that are true for now): Person_Id | Newsletter_Id
  • Second one will store subscriptions log (its updates or status changes)
zerkms
I was trying to do this by having another table called Subscriptions_Log with the same structure as the Subscriptions table. I added a BEFORE INSERT trigger to the Subscriptions that copies any existing row with the same Person_ID-Newsletter_ID from Subscriptions to Subscriptions_Log (which works), and then deletes that row from Subscriptions (which doesn't work -- a trigger can't modify a table that's locked due to the INSERT). So, triggers can't do it, what's the right way of doing this?
Jimmy
why did you insert another record if there is one??? why not just use **UPDATE**?
zerkms
Good question. There may or may not be an existing subscription record for the given Person_ID-Newsletter_ID; It seemed a bit simpler to always add a record, and let triggers do the moving if there was an existing record. But I suppose it would be more efficient to use INSERT... ON UPDATE!
Jimmy
+2  A: 

I've recently run into a somewhat similar problem.

I'm not an SQL expert, so I can't really give much advice on what's the best design for this. But until the pros chip in, maybe this helps:

SELECT s.Person_ID, s.Newsletter_ID  
FROM (
 SELECT MAX(ID) AS mid
 FROM Subscriptions
 GROUP BY 
  Person_ID,Newsletter_ID
) q
JOIN Subscriptions s
ON q.mid = s.ID
WHERE s.Subscribed = 1

Notice I've added an ID colum to your subscriptions table (I'll explain why in a sec).

Now, let's break down how this works (or how I think it works, anyway; I'd be glad to be corrected if I'm wrong).

First, you retrieve all records for a given person / newsletter. This is what the subquery does (yes, I know you said you'd rather not have subqueries, but I'm not sure you can do it without one). I'm grouping by person_id and newsletter_id. This can return more than one row. Notice I'm selecting MAX(ID). If you use an autoincremental ID and it's safe to assume the row with the highest number in the ID column is the newest one for the group (i.e. if you don't insert ID's manually), this subquery will get you the ID of the last row for each person / newsletter.

So, you can join this with the subscriptions table: the join condition is that the ID of the subscriptions row has to match the MAX id you retrieved from the subquery. Here you are only considering the most recent record for each newsletter/person. Then, you factor out inactive subscriptions by using a WHERE condition.

If you want to restrict the result to a given newsletter (or a given person), add that condition to the WHERE clause.

Indices should help making this query run faster.

Hope this helps.

Added

If for some reason you can't guarantee that MAX(Subscriptions.ID) will correspond to the last inserted row, you can probably do something like this (which follows the same logic, I think, but is a bit more verbose and probably less efficient):

SELECT Person_ID, Newsletter_ID  
FROM (
 SELECT MAX(Updated) AS upd, Newsletter_ID AS nid, Person_ID AS pid 
 FROM Subscriptions
 GROUP BY 
  Person_ID,Newsletter_ID
) q
JOIN Subscriptions s
ON q.pid = s.Person_ID AND q.nid = s.Newsletter_ID and q.upd = s.Updated
WHERE Subscribed = 1

New edit

On second thoughts, the alternative I've added (the one with MAX(Updated)) is wrong, I think. You can't know for sure the selected Newsletter_ID and Person_ID in the subquery will be the Newsletter_ID and Person_ID corresponding to the MAX(Updated) row. Since these columns are used for the join condition, this query could give bogus results.

Juan Pablo Califano
The first query appears to work, Thanks! I'd still be interested in a query that doesn't rely on the Subscriptions table having an ID column, but I could live with the ID column if I need to.Regarding your note about indices to make it run faster: should I have indices on Subscriptions.Newsletter_ID and .Person_ID (in addition to each table's own ID column)? Any other indices?
Jimmy
A: 

I think your design is pretty good. There is no inherent slowness for subqueries - use them if is the best way to express your query.

Here's the query that gets you all of the latest (i.e. not overridden) directives:

SELECT Person_ID, Newsletter_ID, Subscribed, MAX(Updated)
FROM Subscriptions GROUP BY Person_ID, Newsletter_ID

Then you can use this query as a subquery of another query to get what you want. For your query #1:

SELECT x.Person_ID, x.Newsletter_ID FROM
  (SELECT Person_ID, Newsletter_ID, Subscribed, MAX(Updated) 
   FROM Subscriptions GROUP BY Person_ID, Newsletter_ID) x
WHERE x.Subscribed;

For query #2:

SELECT x.Person_ID FROM
  (SELECT Person_ID, Newsletter_ID, Subscribed, MAX(Updated)
   FROM Subscriptions GROUP BY Person_ID, Newsletter_ID) x
WHERE x.Subscribed AND x.Newsletter_ID = ?

You'll definitely want an index on Newsletter_ID in the Subscriptions table, as this query will likely be very selective.

Edit: Whoops, the Subscriptions column in the subquery can come from an arbitrary row, not the one that generates the MAX(Updated). You have to rejoin with the original table:

SELECT x.Person_ID, x.Newsletter_ID, y.Subscribed FROM
  (SELECT Person_ID, Newsletter_ID, MAX(Updated) as MaxUpdated
   From Subscriptions GROUP by Person_ID, Newsletter_ID) x
  JOIN Subscriptions y WHERE x.Person_ID = y.Person_ID AND
                             x.Newsletter_ID = y.Newsletter_ID AND
                             x.MaxUpdated = y.Updated
Keith Randall
Thanks; I tried the 'query that gets all of the latest directives', and it doesn't seem to work; In the sample data set, it returns row 2 instead of row 4 (it's like it ignores the MAX() function -- in fact, if I remove MAX() it returns the same thing as with the MAX() function). Now, it *is* possible that I messed up something, but I don't think so... if you want I can give you access to the db to try out?
Jimmy
There is currently no index on the Subscriptions.Updated column -- does there *need* to be?
Jimmy
Maybe it's because the `WHERE` clause doesn't read `Subscribed = 1` (or true). `WHERE Subscribed` seems like an uncoditional condition, so to speak...
Juan Pablo Califano
I marked Subscriptions.Updated as indexed and it made no difference in terms of returned rows -- still (wrongly) returns row 2 instead of 4.
Jimmy
@Jimmy. But have you changed the `WHERE` clause to `WHERE Subscribed = 1`? Indices could make your query more efficient, but they should not affect its "correctness".
Juan Pablo Califano
@Juan: I'm using the first query, with no where, just to see if that works. I noticed something else: in the result, it returns the Updated value of row 4 mashed together with the Subscribed value of row 2! AArgh!
Jimmy
Here's what I'm seeing, if I use this Select statement together with the rest of the first querySELECT Person_ID, Newsletter_ID, Allocation, Updated, MAX(Updated) AS Expr1Person_ID Newsletter_ID Subscribed Updated Expr12 1 true 2010-05-01 2010-05-013 1 true 2010-05-01 2010-05-15 3 2 true 2010-05-10 2010-05-10Noticed how the 2nd row of the result set is a mishmash of rows 2 and 4 from the Subscriptions table
Jimmy
Well that's kind of hard to read! I'll add to the question
Jimmy
@Jimmy. I think that's because it's missing a self JOIN (not sure if that's the proper name) on the last row for each person / newsletter. I was helped to solved this in the problem I linked to in my answer. And I also think I solve this problem in the answer I posted here. Have you tried it?
Juan Pablo Califano
@Juan: I'll try yours next :) sorry, this answer had a simple first step that I wanted to try
Jimmy
@Jimmy. No problem. Just try the first one if you can add an ID column. The second, I think, sort of works for most cases, but is wrong.
Juan Pablo Califano
A: 

ordered analytical functions" is a standard method for this type of problem. 1M records, no problem... depending on the power of your machine, of course.

MAX( Updated) OVER( PARTITION BY list of fields over which you want "max" )

SELECT
  x.*
FROM
  (
    SELECT
       Person_ID
       , Newsletter_ID
       --, Subscribed
       , Updated
       , MAX(Updated) OVER( PARTITION BY Person_ID, Newsletter_ID, Subscribed) AS myUpdated
   FROM Subscriptions
  ) x 
WHERE Updated = myUpdated
Eddie
Sounds useful, but mysql doesn't support this :(
Jimmy
Mea culpa, I didn't check the answer for mySQL.
Eddie