views:

594

answers:

7

After prepairing an answer for this question I found I couldn't verify my answer.

In my first programming job I was told that a query within the IN () predicate gets executed for every row contained in the parent query, and therefore using IN should be avoided.

For example, given the query:

SELECT count(*) FROM Table1 WHERE Table1Id NOT IN (
SELECT Table1Id FROM Table2 WHERE id_user = 1)
Table1 Rows | # of "IN" executions
----------------------------------
      10    |       10
     100    |      100
    1000    |     1000
   10000    |    10000

Is this correct? How does the IN predicate actually work?

A: 

Not really. But it's butter to write such queries using JOIN

Koistya Navin
+4  A: 

Depends on optimizer. Check exact query plan for each particular query to see how the RDBMS will actually execute that.

In Oracle that'd be:

EXPLAIN PLAN FOR «your query»

In MySQL or PostgreSQL

EXPLAIN «your query»
vartec
@vartec - any resources on reading query plans? This is a blind side of my knowledge and I've yet to find a decent article online.
Gavin Miller
it's DBMS dependent, but basically that gives you idea what basic operation it will do and in which order, which index will it use etc.
vartec
I think that what you'd have to look for, would be article about DB tuning, as this is a purpose of looking at query plans in the first place.
vartec
See this SO question: http://stackoverflow.com/questions/79266/how-do-you-interpret-a-querys-explain-plan
Nathan Koop
I added a new question: http://stackoverflow.com/questions/761204/list-of-resources-for-database-tuning
Gavin Miller
good question actually, I've never though about how do I know about this stuff. Guess it came with experience ;-)
vartec
+7  A: 

It will entirely depend on the database you're using, and the exact query.

Query optimisers are very smart at times - in your sample query, I'd expect the better databases to be able to use the same sort of techniques that they do with a join. More naive databases may just execute the same query many times.

Jon Skeet
I agree, I just did a test query and I believe (I'm not great at reading ExecutionPlans) it created an inner join.
Nathan Koop
+3  A: 

Most SQL engines nowadays will almost always create the same execution plan for LEFT JOIN, NOT IN and NOT EXISTS

I would say look at your execution plan and find out :-)

Also if you have NULL values for the Table1Id column you will not get any data back

SQLMenace
+3  A: 

This depends on the RDBMS in question.

See detailed analysis here:

In short:

  1. MySQL will optimize the query to this:

    SELECT  COUNT(*)
    FROM    Table1 t1
    WHERE   NOT EXISTS
            (
            SELECT  1
            FROM    Table2 t2
            WHERE   t2.id_user = 1
                    AND t2.Table1ID = t1.Table2ID
            )
    

    and run the inner subquery in a loop, using the index lookup each time.

  2. SQL Server will use MERGE ANTI JOIN.

    The inner subquery will not be "executed" in a common sense of word, instead, the results from both query and subquery will be fetched concurrently.

    See the link above for detailed explanation.

  3. Oracle will use HASH ANTI JOIN.

    The inner subquery will be executed once, and a hash table will be built from the resultset.

    The values from the outer query will be looked up in the hash table.

  4. PostgreSQL will use NOT (HASHED SUBPLAN).

    Much like Oracle.

Note that rewriting the query as this:

SELECT  (
        SELECT  COUNT(*)
        FROM    Table1
        ) - 
        (
        SELECT  COUNT(*)
        FROM    Table2 t2
        WHERE   (t2.id_user, t2.Table1ID) IN
                (
                SELECT  1, Table1ID
                FROM    Table1
                )
        )

will greatly improve the performance in all four systems.

Quassnoi
Last time I looked, SQL Server only allows a single column to be returned from a subquery that is then used in an IN predicate. Has that changed?
RolandTumble
@RolandTumble: no. `MySQL` and `PostgreSQL` allow that, `SQL Server` and `Oracle` don't. For the latter two systems, the `IN` predicate should be rewritten as `EXISTS`.
Quassnoi
A: 

Yes, but execution stops as soon as the query processer "finds" the value you are looking for... So if, for example the first row in the outer select has Table1Id = 32, and if Table2 has a record with a TableId = 32, then as soon as the subquery finds the row in Table2 where TableId = 32, it stops...

Charles Bretana
+12  A: 

The warning you got about subqueries executing for each row is true -- for correlated subqueries.

SELECT COUNT(*) FROM Table1 a 
WHERE a.Table1id NOT IN (
  SELECT b.Table1Id FROM Table2 b WHERE b.id_user = a.id_user
);

Note that the subquery references the id_user column of the outer query. The value of id_user on each row of Table1 may be different. So the subquery's result will likely be different, depending on the current row in the outer query. The RDBMS must execute the subquery many times, once for each row in the outer query.

The example you tested is a non-correlated subquery. Most modern RDBMS optimizers worth their salt should be able to tell when the subquery's result doesn't depend on the values in each row of the outer query. In that case, the RDBMS runs the subquery a single time, caches its result, and uses it repeatedly for the predicate in the outer query.

PS: In SQL, IN() is called a "predicate," not a statement. A predicate is a part of the language that evaluates to either true or false, but cannot necessarily be executed independently as a statement. That is, you can't just run this as an SQL query: "2 IN (1,2,3);" Although this is a valid predicate, it's not a valid statement.

Bill Karwin
+1 Great answer; Also thanks for the terminology lesson!
Gavin Miller
In MySQL, the subquery will be optimized to NOT EXISTS (which is DEPENDENT SUBQUERY), and will use index access on `INDEX ON Table2 (id_user, table1ID)` if it exists.
Quassnoi