views:

219

answers:

3

I'd like to select all rows from one table which match "one or more" rows in another table, in the most efficient way.

SELECT identity.id FROM identity
INNER JOIN task ON
  task.identityid=identity.id
  AND task.groupid IN (78, 122, 345, 12, 234, 778, 233, 123, 33)

Currently if there are multiple matching tasks this returns the same identity multiple times (but the performance penalty of eliminating these later is not too bad). I'd like this to instead return only one row for each identity, that matches one or more of these task groups, and I was wondering if there was a more efficient way than to do DISTINCT or GROUP BY.

The trouble with doing DISTINCT or GROUP BY is that the task table is still scanned for all groupid matches, then they are later reduced down to one by way of a temporary table (sometimes with filesort). I would rather it do some sort of short-circuit evaluation - do not pursue further any subsequent task matches for same identity after it has found one.

I was thinking of doing an EXISTS subquery, but I don't know how these are optimised. I'd need for it to join the task table first, before the identity table, so I am not doing a full scan of the identity table which is very large and will have a lot of non-matches.

A: 

Does MYSQL support the TOP N syntax? If so:

SELECT TOP 1 identity.id FROM identity
INNER JOIN task ON
  task.identityid=identity.id
  AND task.groupid IN (78, 122, 345, 12, 234, 778, 233, 123, 33)
1800 INFORMATION
the mysql syntax (instead of TOP 1 right after SELECT) would be to add ORDER BY identity.id DESC LIMIT 1 at the end of the query - but either TOP or LIMIT produce a single-row answer which seems quite different from what the question requests.
Alex Martelli
+1  A: 

Just using "SELECT DISTINCT" with what you have should be efficient in mysql. You may need to put your values in a table and join to it, rather than using "IN ( ... )".

le dorfier
When I use 'DISTINCT' it shows 'Using temporary table'. It still seems pretty fast for my simplified tests, but doesn't that add a fair amount of overhead which could catch up with me? Are temporary tables for DISTINCT ever fast/in-memory?
thomasrutter
Note the change. Mysql does like to do temp tables, but usually fairly efficiently. The WHERE EXISTS strategy is usally the most frequent cross-server recommendation, and should also work. (WHERE ... IN ( ... ) just makes me shudder - it usually means an automatic UNION.)
le dorfier
A: 

Exists should perform just fine for you, as long as the column you are comparing in the subquery is indexed.

I would expect that the exists would perform just a little better than a join-and-group-by, but I would have to try it out to be sure. I've run across enough performance stuff in MySQL where my prediction was wrong to know it's worth giving it a try.

MBCook
I gave it a try, and EXPLAIN showed that it joined the identity table before the task table, so it executed the subquery for every row in the identity table. This isn't the order I want, but it's hard to say whether this is just because the test data I have is so small - maybe it would join it the other way with lots more identities. I'll have to test with a large amount of data to find out!
thomasrutter
I guess I may also have misunderstood how EXPLAIN shows join order for subqueries...
thomasrutter