views:

56

answers:

4

Say I have a game where a question is asked, people post responses which are scored, and the top 10 responses win. I have a SQL database that stores all of this information, so I might have tables such as Users, Questions, and Responses. The Responses table has foreign_keys user_id and question_id, and attribute total_score.

Obviously, for a particular Question I can retrieve the top 10 Responses with an order and limit:

SELECT * FROM Responses WHERE question_id=? ORDER BY total_score DESC LIMIT 10;

What I'm looking for is a way I can determine, for a particular User, a list of all their Responses that are winners (in the top 10 for their particular Question). It is simple programmatically to step through each Response and see if it is included in the top 10 for its Question, but I would like to optimize this so I am not doing N+1 queries where N is the number of Responses the User has submitted.

+2  A: 

Try an embedded select statement. I don't have access to a DB tool today so I can't confirm the syntax/output. Just make the appropriate changes to capture all the columns you need. You can also add questions to the main query and join off of responses.

select *
  from users
     , responses
 where users.user_id=responses.user_id
   and responses.response_id in (SELECT z.response_id 
                                   FROM Responses z 
                                  WHERE z.user_id = users.user_id 
                                 ORDER BY total_score DESC 
                                 LIMIT 10)
northpole
That won't work, because the Response candidate rows need to be grouped by question_id before you can determine which are the 10 winners *for that question*.
Confusion
that's what I mean by including the questions. My example will get you this persons top 10 winners, not top 10 by question.
northpole
It will give you the top 10 highest scoring responses for this user. However, it could very well be that none of those are winners, because every question that each of those responses belong to has 10 higher scoring responses by 10 other people.
Confusion
So the key to your idea is that a value from the outer query, for example users.user_id, can be used in the sub-query? Will the sub-query really be regenerated for each different users.user_id? Because I can see how I would do something like:select * from responses where responses.user_id='somedude' and responses.id in (select r.id from responses r where r.question_id=responses.question_id order by total_score DESC LIMIT 10);
MikeJ
@MikeJ: Yes, that's called a *correlated subquery*. The subquery must be regenerated for each row of the outer query, because the result of the subquery could be different for each different user_id. It's still a correlated subquery if you depend on the question_id of the outer query.
Bill Karwin
@mikeJ - yes it will be based off of each user_id unless you provide a specific user_id in the main where clause.@confusion - just add questions to the subselect (and add joins), or join off of the parent responses (so responses.question_id = z.question_id)
northpole
While I love the simplicity or this approach and your educating me about "correlated subqueries" I have one minor gripe, which is that this runs up against the well-known MySQL limitation of LIMIT not being allowed in a subquery. Of course I never mentioned I was using MySQL (my bad). This is a great solution in most databases though.
MikeJ
+1  A: 

Or you can really optimize it by adding another field like "IsTopPost". You would have to update the top posts when someone votes, but your query would be simple:

SELECT * FROM Responses WHERE user_id=? and IsTopPost = 1
GalacticJello
This would work, but you might be thrown on a sword by your DBA doing it this way. In my experience DBAs prefer you to derive values instead of storing them, if possible. For example, don't store the count of rows in a table but rather count them when you need them.
northpole
I agree to an extent, but it also depends on the scale and the access pattern. If there are billions of responses and you needed to calculate the answer often, the cardinality of adding that field would save tons of reads as well as locks, which would make your DBA happy. If this is a one-off report, by all means calculate it on the fly and take the hit.
GalacticJello
GalacticJello, you are absolutely correct here. There are times when this could be okay, and your example is one of them.
northpole
I like the idea but this is the sort of thing I would leave until later as a tweak when I actually have my first billion users :)
MikeJ
+1  A: 

I think something like this should do the trick:

SELECT 
    user_id, question_id, response_id 
FROM 
    Responses AS r1 
WHERE 
    user_id = ?
AND
    response_id IN (SELECT response_id 
                    FROM Responses AS r2 
                    WHERE r2.question_id = r1.question_id 
                    ORDER BY total_score DESC LIMIT 10)

Effectively, for each question_id, a subquery is performed which determines the top 10 responses for that question_id.

You may want to consider adding a column which marks certain Responses as 'winners'. That way, you can simply select those rows and save the database from having to calculate the top 10's over and over again.

Confusion
As I noted on the first response, the only problem is that LIMIT in a subquery will not work in MySQL but works great aside from that.
MikeJ
+2  A: 

If you use Oracle, Microsoft SQL Server, DB2, or PostgreSQL, these databases support windowing functions. Join the user's responses to other responses to the same question. Then partition by question and order by score descending. Use the row number within each partition to restrict the set to those in the top 10. Also pass along the user_id of the given user so you can pick them out of the top 10, since you're only interested in the given user's responses.

SELECT *
FROM (
  SELECT r1.user_id AS given_user, r2.*,
    ROW_NUMBER() OVER (PARTITION BY r2.question_id ORDER BY r2.total_score DESC) AS rownum
  FROM Responses r1 JOIN Responses r2 ON r1.question_id = r2.question_id
  WHERE r1.user_id = ?
) t
WHERE rownum <= 10 AND user_id = given_user;

However, if you use MySQL or SQLite or other databases that don't support windowing functions, you can use this different solution:

Query for the user's responses, and use a join to match other responses to the respective questions with greater score (or earlier PK in the case of ties). Group by question, and count the number of responses that have higher score. If the count is fewer than 10, then the user's response is among the top 10 per question.

SELECT r1.*
FROM Responses r1
LEFT OUTER JOIN Responses r2 ON r1.question_id = r2.question_id 
  AND (r1.total_score < r2.total_score 
    OR r1.total_score = r2.total_score AND r1.response_id > r2.response_id)
WHERE r1.user_id = ?
GROUP BY r1.question_id
HAVING COUNT(*) < 10;
Bill Karwin
This is ridiculously clever Bill... my only question is (and I will try it for myself tomorrow when I have time) will this be faster than a repeated subquery? There's a lot of calculation going on.
MikeJ
The performance varies a lot by brand of database. Each RDBMS is more efficient at different types of queries. And one might be better for your specific collection of data. I don't know what database you're using, so all I can advise is for you to try both methods and see which one you like better.
Bill Karwin
I am writing a Rails app, so I'm mostly using MySQL and SQLite at the moment, but trying to be as database agnostic as possible to start--database-specific optimizations can come later. I really like your second example because it looks like it would work on almost any RDBMS without workarounds.
MikeJ
I should mention that the GROUP BY statement needs to be "GROUP BY r1.response_id". Aside from that, this query is working great for me. I should also note for anyone else reading this that this query is not intended for finding Responses that are the #1 score because both the top Response and the second-best Response will have COUNT(*)==1. In other words HAVING COUNT(*) < 2 will find Responses that are in the top 2 for their question, and using HAVING COUNT(*) < 1 will always return an empty set.
MikeJ