views:

2124

answers:

8

Consider the following 2 queries:

select tblA.a,tblA.b,tblA.c,tblA.d from tblA where tblA.a not in (select tblB.a from tblB)

select tblA.a,tblA.b,tblA.c,tblA.d from tblA left outer join tblB on tblA.a = tblB.a where tblB.a is null

Which will perform better? My assumption is that in general the join will be better except in cases where the subselect returns a very small result set.

A: 

From my observations, MSSQL server produces same query plan for these queries.

aku
A: 

I created a simple query similar to the ones in the question on MSSQL2005 and the explain plans were different. The first query appears to be faster. I am not a SQL expert but the estimated explain plan had 37% for query 1 and 63% for the query 2. It appears that the biggest cost for query 2 is the join. Both queries had two table scans.

Mike Polen
+9  A: 

RDBMSs "rewrite" queries to optimize them, so it depends on system you're using, and I would guess they end up giving the same performance on most "good" databases.

I suggest picking the one that is clearer and easier to maintain, for my money, that's the first one. It's much easier to debug the subquery as it can be run independently to check for sanity.

Tom
+2  A: 

I second Tom's answer that you should pick the one that is easier to understand and maintain.

The query plan of any query in any database cannot be predicted because you haven't given us indexes or data distributions. The only way to predict which is faster is to run them against your database.

As a rule of thumb I tend to use sub-selects when I do not need to include any columns from tblB in my select clause. I would definitely go for a sub-select when I want to use the 'in' predicate (and usually for the 'not in' that you included in the question), for the simple reason that these are easier to understand when you or someone else has come back and change them.

andy47
A: 

The first query will be faster in SQL Server which I think is slighty counter intuitive - Sub queries seem like they should be slower. In some cases (as data volumes increase) an exists may be faster than an in.

Martynnw
+3  A: 

non-correlated sub queries are fine. you should go with what describes the data you're wanting. as has been noted, this likely gets rewritten into the same plan, but isn't guaranteed to! what's more, if table A and B are not 1:1 you will get duplicate tuples from the join query (as the IN clause performs an implicit DISTINCT sort), so it's always best to code what you want and actually think about the outcome.

Andy Irving
A: 

It should be noted that these queries will produce different results if TblB.a is not unique.

David B
+2  A: 

Well, it depends on the datasets. From my experience, if You have small dataset then go for a NOT IN if it's large go for a LEFT JOIN. The NOT IN clause seems to be very slow on large datasets.

One other thing I might add is that the explain plans might be misleading. I've seen several queries where explain was sky high and the query run under 1s. On the other hand I've seen queries with excellent explain plan and they could run for hours.

So all in all do test on your data and see for yourself.

Piotr Anders