views:

225

answers:

5

Basically I'm trying to pull a random poll question that a user has not yet responded to from a database. This query takes about 10-20 seconds to execute, which is obviously no good! The responses table is about 30K rows and the database also has about 300 questions.

SELECT  questions.id
FROM  questions
LEFT JOIN  responses ON ( questions.id = responses.questionID
AND responses.username =  'someuser' ) 
WHERE
responses.username IS NULL 
ORDER BY RAND() ASC 
LIMIT 1

PK for questions and reponses tables is 'id' if that matters.

Any advice would be greatly appreciated.

+3  A: 

The problem is probably not the join, it's almost certainly sorting 30k rows by order rand()

Allyn
Thanks for the advice. I'll look into the order by rand() issue. Making questionID and username columns indexes did the trick. Hopefully doing the randomize stuff outside the query will make it even faster.
Brian
ORDER BY is a necessary evil - there literally *isn't* any alternative to ordering your results. But common mistakes are using ORDER BY statements in both normal and inline views - unless you have a really good reason, **only** define ORDER BY for the outer most query.
OMG Ponies
+3  A: 

See: Do not order by rand

He suggests (replace quotes in this example with your query)

SELECT COUNT(*) AS cnt FROM quotes

-- generate random number between 0 and cnt-1 in your programming language and run 
-- the query:

SELECT quote FROM quotes LIMIT $generated_number, 1

Of course you could probably make the first statement a subselect inside the second.

Tom Leys
I doubt that randomly ordering 300 questions is that slow.
ChaosPandion
Randomly ordering a 300-row table joined against a 30,000-row table very well might be, though.
BipedalShark
well it never really joins anything. thats the point, right?
Brian
+4  A: 

You most likely need an index on

responses.questionID
responses.username

Without the index searching through 30k rows will always be slow.

ChaosPandion
this fixed my main problem. Looks like i need to address "order by rand()" as well. Thanks!
Brian
@Chaos: I responded to your comment in nickf's answer.
OMG Ponies
+3  A: 

Here's a different approach to the query which might be faster:

SELECT q.id
FROM questions q
WHERE q.id NOT IN (
    SELECT r.questionID
    FROM responses r
    WHERE r.username = 'someuser'
)

Make sure there is an index on r.username and that should be pretty quick.

The above will return all the unanswered questios. To choose the random one, you could go with the inefficient (but easy) ORDER BY RAND() LIMIT 1, or use the method suggested by Tom Leys.

nickf
Are you really saying this would be faster than a join?
ChaosPandion
yeah, ive read that subqueries are almost always worse than joins.
Brian
OMG Ponies
@rexem: Interesting, in that case I would use the Left Join because I think it looks better.
ChaosPandion
@Chaos: really? I find the `NOT IN` to be far more descriptive of what you're actually trying to do. Perhaps it's just me, but I can read the `NOT IN` version like a sentence "get the id of questions where 'someuser' hasn't responded to it". I couldn't do the same with the LEFT JOIN.
nickf
@Chaos: Looking for nulls in a LEFT JOIN in reads as making the JOIN, and then excluding those. MySQL is the **only** DB I'm aware of where the LEFT JOIN/IS NULL is even as fast as NOT IN or NOT EXISTS (both of which are more readable). Be careful of thinking your query is optimized, when it really isn't.
OMG Ponies
this is dumb, the sub query must execute once for each row in questions. left join is far superior
Shawn Simon
@unknown: That is only the case when a SELECT is used **within** the SELECT clause. Read the link, and mind what I said about how the LEFT JOIN/IS NULL is **only** valid for MySQL...
OMG Ponies
never knew that!
Shawn Simon
A: 

Is OP even sure the original query returns the correct result set?

I assume the "AND responses.username = 'someuser'" clause was added to join specification with intention that join will then generate null rightside columns for only the id's that someuser has not answered.

My question: won't that join generate null rightside columns for every question.id that has not been answered by all users? The left join works such that, "If any row from the target table does not match the join expression, then NULL values are generated for all column references to the target table in the SELECT column list."

In any case, nickf's suggestion looks good to me.

Herbert Sitz
@Herbert: You've hit the exact reason why I cringe when I read LEFT JOIN/IS NULL evaluations :) But MySQL is the only DB I'm aware of that treats it as an equivalent to NOT IN: http://explainextended.com/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql/
OMG Ponies
@rexem -- Oh, thanks. I know that including those non-equijoin conditions in left join spec cause problems in ANSI SQL, because operation is different from what users expect. Are you saying MySQL has avoided that problem by abandoning the ANSI SQL specs for left joins?
Herbert Sitz
@Herbert: All I know is the MySQL optimizer handles it that way. Every DB has its quirks...
OMG Ponies
@rexem -- Maybe we're talking about different things. The issue was pointing out had nothing to do with optimization. It has to do with users misunderstanding SQL and how adding a separate condition in the join clause works. The results mandated by ANSI SQL spec are nonintuitive, different from what many user expects. Unless OP has confirmed results of original query were correct, I suspect that join is causing a problem not just with optimization, but with the accuracy of the result set itself.
Herbert Sitz