views:

343

answers:

4

Given the following table structure:

CREATE TABLE user (
   uid INT(11) auto_increment,
   name VARCHAR(200),
   PRIMARY KEY(uid)
);
CREATE TABLE user_profile(
   uid INT(11),
   address VARCHAR(200),
   PRIMARY KEY(uid),
   INDEX(address)
);

Which join query is more efficient: #1,

SELECT u.name FROM user u INNER JOIN user_profile p ON u.uid = p.uid WHERE p.address = 'some constant'

or #2:

SELECT u.name FROM user u INNER JOIN (SELECT uid FROM user_profile WHERE p.address = 'some constant') p ON u.uid = p.uid

How much is the difference in efficiency?

+5  A: 

The first syntax is generally more efficient.

MySQL buffers the derived queries so using the derived query robs the user_profile of possibility to be a driven table in the join.

Even if the user_profile is leading, the subquery results should be buffered first which implies a memory and performance impact.

A LIMIT applied to the queries will make the first query much faster which is not true for the second one.

Here are the sample plans. There is an index on (val, nid) in the table t_source:

First query:

EXPLAIN
SELECT  *
FROM    t_source s1
JOIN    t_source s2
ON      s2.nid = s1.id
WHERE   s2.val = 1

1, 'SIMPLE', 's1', 'ALL', 'PRIMARY', '', '', '', 1000000, ''
1, 'SIMPLE', 's2', 'ref', 'ix_source_val,ix_source_val_nid,ix_source_vald_nid', 'ix_source_val_nid', '8', 'const,test.s1.id', 1, 'Using where'

Second query:

EXPLAIN
SELECT  *
FROM    t_source s1
JOIN    (
        SELECT  nid
        FROM    t_source s2
        WHERE   val = 1
        ) q
ON      q.nid = s1.id

1, 'PRIMARY', '<derived2>', 'ALL', '', '', '', '', 100000, ''
1, 'PRIMARY', 's1', 'ref', 'PRIMARY', 'PRIMARY', '4', 'q.nid', 10000, 'Using where'
2, 'DERIVED', 's2', 'ref', 'ix_source_val,ix_source_val_nid,ix_source_vald_nid', 'ix_source_vald_nid', '4', '', 91324, 'Using index'

As you can see, only a part of the index is used in the second case, and q is forced to be leading.

Update:

Derived queries (which is what this question concerns) are not to be confused with the subqueries.

While MySQL is not able to optimize derived queries (those used in the FROM clause), the subqueries (those used with IN or EXISTS) are treated much better.

See these articles in my blog for more detail:

Quassnoi
+1 for a detailed answer, and thank you for the heads-up on the ridiculous failings of the MySQL optimizer; I've had issues with it before, but nothing quite that poor.
James McNellis
A: 

Not sure how the MySQL's query engine would handle that, but my assumption would be the first query would perform better and be more efficient.

The first query is also more standard and the easier to read of the two therefore more preferable.

Justin Niessner
A: 

The answer usually depends on the statistics gathered by database. The first form seems to be easier for optimizer.

As far as I remember, MySQL doesn't work well with IN... queries and subselects

Dmitry
The first form is better indeed, but `IN` is a special case in `MySQL` and it is optimized quite well. You may want to read this article: http://explainextended.com/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql/
Quassnoi
+1  A: 

Looking at the explain queries for these selects, we get this: (row headers are id, select_type, table, type, possible_keys, key, key_len, ref, rows, extra)

1   SIMPLE u system PRIMARY NULL NULL NULL 1 
1   SIMPLE p const PRIMARY,address PRIMARY 4 const 1

And the EXPLAIN for the second...

1   PRIMARY u system PRIMARY NULL NULL NULL 1 
1   PRIMARY <derived2> system NULL NULL NULL NULL 1  
2   DERIVED p ref address address 201  1 Using where

So, the first query is simpler, and simpler is usually more efficient.

However, from your CREATEs, it would be vastly more efficient to add the address field to the user table. Since profile is 1-to-1 with the user table (on uid), it is possible to combine the tables and still keep the schema normalized.

Then, your query would be

SELECT u.name FROM user u WHERE u.address = 'some constant'

and the explain shows

1   SIMPLE u ref address address 201 const 1 Using where, using filesort

Oddly, the simplified schema uses filesorting, which is bad if you have lots of rows.

More on explain: http://dev.mysql.com/doc/refman/5.0/en/explain.html

davethegr8
I've never seen the filesort in such a query in `MySQL`, but in other systems this is used to speedup the index lookup. It takes the row pointers from the index and sorts them so that the row lookups will be sequential. The benefits of sequential access to the table outweight the cost to sort the index results.
Quassnoi
I was pretty surprised that it did a filesort as well, but it can be a performance hog if you have a large table. If you have 1000 results or so, it'll take some time to write that out to a file, sort the rows in the file, and then proceed. Interesting that it's a *optimization* however. I'm not very familiar with the internals of DBs.
davethegr8
In this case `filesort` does not write to a file. It can write to a file indeed (which would be reflected in the plan as `using temporary`), but it does not in this case. `Filesort` is improperly named, it should be just a `sort`.
Quassnoi
I should say though that I'm quite surprised to see it here anyway, since this is a const scan and the row pointers are already ordered within a single index key. More than that, it should have been optimized out since you only have `1` row in your table (if your plan is correct and the statistics are up to date). Which version of `MySQL` are you using?
Quassnoi
I actually had two rows at that point. Apparently I have 4.1.20 on my localhost. Man... I need to upgrade. Thanks for the details about filesort, it's good to know.
davethegr8