views:

249

answers:

5

I have two tables: 'movies' and 'users'. There's an n:m relationship between those, describing what movies a user has seen. This is described with a table 'seen' Now i want to find out for a given user, all the movies he has not seen. My current solution is like this:

SELECT *
FROM movies 
WHERE movies.id NOT IN (
     SELECT seen.movie_id 
     FROM seen 
     WHERE seen.user_id=123
)

This works fine but does not seem to scale very well. Is there a better approach to this?

+3  A: 

Not only does your query work, it's the right approach to the problem as stated. Perhaps you can find a different way to approach the problem? A simple LIMIT on your outer select should be very fast even for large tables, for instance.

dwc
+3  A: 

Seen is your join table, so yes, this looks like the correct solution. You are effectively "subtracting" the set of movie IDs in SEEN (for a user) from the totality in MOVIES, resulting in the unseen movies for that user.

This is called a "negative join", and sadly NOT IN or NOT EXISTS are the best options. (I would love to see a negative join syntax that was similar to INNER/OUTER/LEFT/RIGHT joins, but where the ON clause could be a subtraction statement).

@Bill's solution without a subquery should work, although as he noted it is a good idea to test your solution for performance both ways. I suspect that subquery or not, the entire SEEN.ID index (and of course the entire MOVIE.ID index) is going to be evaluated both ways: it will depend on how the optimizer handles it from there.

Godeke
A: 

If your DBMS supports bitmap indexes, you could try them.

abababa22
He tagged the question 'mysql'. MySQL does not support bitmap indexes.
Bill Karwin
Oops, I didn't look at the tag. :(
abababa22
A: 

This works fine but does not seem to scale very well. Is there a better approach to this?

Did you try EXPLAIN on this query?

VolkerK
+4  A: 

Here's a typical way to do this query without using the subquery method you showed. This may satisfy @Godeke's request to see a join-based solution.

SELECT * 
FROM movies m
 LEFT OUTER JOIN seen s
 ON (m.id = s.movie_id AND s.user_id = 123)
WHERE s.movie_id IS NULL;

However, in most brands of database this solution can perform worse than the subquery solution. It's best to use EXPLAIN to analyze both queries, to see which one will do better given your schema and data.

Here's another variation on the subquery solution:

SELECT * 
FROM movies m
WHERE NOT EXISTS (SELECT * FROM seen s 
                  WHERE s.movie_id = m.id 
                    AND s.user_id=123);

This is a correlated subquery, which must be evaluated for every row of the outer query. Usually this is expensive, and your original example query is better. On the other hand, in MySQL "NOT EXISTS" is often better than "column NOT IN (...)"

Again, you must test each solution and compare the results to be sure. It's a waste of time to choose any solution without measuring performance.

Bill Karwin