views:

117

answers:

5

I have a table files with files and a table reades with read accesses to these files. In the table reades there is a column file_id where refers to the respective column in files.

Now I would like to list all files which have not been accessed and tried this:

SELECT * FROM files WHERE file_id NOT IN (SELECT file_id FROM reades)

This is terribly slow. The reason is that mySQL thinks that the subquery is dependent on the query:

+----+--------------------+--------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type        | table  | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
+----+--------------------+--------+------+---------------+------+---------+------+------+----------+-------------+
|  1 | PRIMARY            | files  | ALL  | NULL          | NULL | NULL    | NULL | 1053 |   100.00 | Using where |
|  2 | DEPENDENT SUBQUERY | reades | ALL  | NULL          | NULL | NULL    | NULL | 3242 |   100.00 | Using where |
+----+--------------------+--------+------+---------------+------+---------+------+------+----------+-------------+

But why? The subquery is completely independent and more or less just meant to return a list of ids.

(To be precise: Each file_id can appear multiple times in reades, of course, as there can be arbitrarily many read operations for each file.)

+2  A: 

Try:

SELECT * FROM files WHERE file_id NOT IN (SELECT reades.file_id FROM reades)

That is: if it's coming up as dependent, perhaps that's because of ambiguity in what file_id refers to, so let's try fully qualifying it.

If that doesn't work, just do:

SELECT files.*
FROM files
LEFT JOIN reades
USING (file_id)
WHERE reades.file_id IS NULL
chaos
Okay. Yeah, I have no idea why that would be.
chaos
The first suggestion is also recognized as a dependent subquery. The second works, but is slow.
fuenfundachtzig
Well, it appears to me that the MySQL parser is wilfully misunderstanding the query - or the note 'DEPENDENT SUBQUERY' means something different from 'correlated sub-query', which is the term normally used for a sub-query that depends on a 'current row value' from the main query.
Jonathan Leffler
Is `file_id` indexed on both tables?
chaos
No. If I do index it, your solution is also quick.
fuenfundachtzig
+2  A: 

Try replacing the subquery with a join:

SELECT * 
FROM files f
LEFT OUTER JOIN reades r on r.file_id = f.file_id
WHERE r.file_id IS NULL

Here's a link to an article about this problem. The writer of that article wrote a stored procedure to force MySQL to evaluate subqueries as independant. I doubt that's necessary in this case though.

Andomar
Hm. EXPLAIN identifies no correlated queries, but it's still very slow.
fuenfundachtzig
This should be fast on any decent machine. Consider adding an index on reades.file_id, or making it a foreign key.
Andomar
I added a foreign key (ALTER TABLE reades ADD FOREIGN KEY (file_id) REFERENCES files (file_id);) and it's much faster now. (Though I thought mySQL did not support foreign keys?!)
fuenfundachtzig
Creating a foreign key implies creating an index, which speeds things up considerably.
Andomar
+2  A: 

i've seen this before. it's a bug in mysql. try this:

SELECT * FROM files WHERE file_id NOT IN (SELECT * FROM (SELECT file_id FROM reades))

there bug report is here: http://bugs.mysql.com/bug.php?id=25926

longneck
found it: http://bugs.mysql.com/bug.php?id=25926
longneck
+1 Interesting link; you could edit your question, instead of adding the link as a comment.
Andomar
A: 

Does MySQL support EXISTS in the same way that MSSQL would? If so, you could rewrite the query as

SELECT * FROM files as f WHERE file_id NOT EXISTS (SELECT 1 FROM reades r WHERE r.file_id = f.file_id)

Using IN is horribly inefficient as it runs that subquery for each row in the parent query.

Chad
A: 

Looking at this page I found two possible solutions which both work. Just for completeness I add one of those, similar to the answers with JOINs shown above, but it is fast even without using foreign keys:

  SELECT * FROM files AS f 
    INNER JOIN (SELECT DISTINCT file_id FROM reades) AS r 
    ON f.file_id = r.file_id

This solves the problem, but still this does not answer my question :)

EDIT: If I interpret the EXPLAIN output correctly, this is fast, because the interpreter generates a temporary index:

+----+-------------+------------+--------+---------------+---------+---------+-----------+------+--------------------------+
| id | select_type | table      | type   | possible_keys | key     | key_len | ref       | rows | Extra                    |
+----+-------------+------------+--------+---------------+---------+---------+-----------+------+--------------------------+
|  1 | PRIMARY     | <derived2> | ALL    | NULL          | NULL    | NULL    | NULL      |  843 |                          |
|  1 | PRIMARY     | f          | eq_ref | PRIMARY       | PRIMARY | 4       | r.file_id |    1 |                          |
|  2 | DERIVED     | reades     | range  | NULL          | file_id | 5       | NULL      |  811 | Using index for group-by |
+----+-------------+------------+--------+---------------+---------+---------+-----------+------+--------------------------+
fuenfundachtzig
If you have to run DISTINCT for your `READES.file_id`, that's information we wouldn't be aware of.
OMG Ponies
Does "MySQL has a sloppy query optimizer" answer your question? :P
Andomar
I wouldn't have imagined that indexing a column has such an impact in this kind of queries... I thought mySQL would recognize the subquery as constant, but obviously one has to be careful with such assumptions.
fuenfundachtzig