views:

928

answers:

4

I'm trying to figure out which is faster, a clause like "WHERE IN (SELECT 1 FROM MyTable)", or a clause like "WHERE EXISTS (SELECT 1 FROM MyTable).

Let's use the query from the SqlServer documentation:

SELECT * FROM Orders 
WHERE ShipRegion = 'WA' AND EXISTS (
    SELECT EmployeeID FROM Employees AS Emp 
    WHERE Emp.EmployeeID = Orders.EmployeeID)

Or

SELECT * FROM Order
WHERE ShipRegion = 'WA' AND EmployeeID IN (
    SELECT EmployeeID FROM Employees AS Emp 
    WHERE Emp.EmployeeID = Orders.EmployeeID)

I'd like to know the answer, if anyone has it, but I'd really like to know how to test it for myself in SqlServer 2005. (I'm a noob at SqlServer.)

Thanks!

+1  A: 

You could also remove the WHERE clause in the IN case:

SELECT * FROM Orders
WHERE ShipRegion = 'WA' AND EmployeeID IN (SELECT EmployeeID FROM Employees)

The query optimizer should be able to generate an identical execution plan for both queries. I'd choose the one that's more readable.

Mehrdad Afshari
Actually, maybe this is not a good example. A foreign key constraint from Orders into Employees would remove the need for the test.
John Saunders
@John: Indeed. Of course, I made no assumptions about that.
Mehrdad Afshari
Hmm... I guess this goes back to whether IN() stops when it finds the first match. If it does, that would suggest there's no difference between my two examples. OTOH, if it doesn't, then even without the where clause, EXISTS() will be faster, no? Thanks for the response.
EoRaptor013
Why should it *not* stop after first match? Theoretically, a good query optimizer should execute all of the efficiently. In practice, some DB engines might be better at optimizing one variant.
Mehrdad Afshari
Good points, his queries are not equal so any discussion about which one is more efficient is comparing apples to pears. Though I think the result should be logically equivalent, I don't expect the optimizer to be very good at removing these unnecessary conditions because they should be rather uncommon.
erikkallen
+1  A: 

To see for yourself, you can: compare real execution costs, run

SET STATISTICS IO ON
SET STATISTICS TIME ON

then run both queries

Also compare execution plans, highlight both queries and press Ctrl+L and you will see the plans. Most likely you will see identical plans.

AlexKuznetsov
Thank you! This was the other part of the question I was looking for. I knew there had to be ways to measure this stuff, but I didn't know how. Now I do!
EoRaptor013
+1  A: 

The sql sub query although identical will not give you the answer you are looking for as it is co-related and could be changed into a JOIN.

In general EXISTS() should be quicker as it gives a result without having to find any more relations once it has found the first row whereas IN() still has to find subsequent rows until it has finished.

therefore

SELECT * FROM Orders 
WHERE ShipRegion = 'WA' AND EXISTS (
    SELECT 'x' FROM Employees AS Emp 
    WHERE Emp.EmployeeID = 42)

should finish before

SELECT * FROM Order
WHERE ShipRegion = 'WA' AND EmployeeID IN (
    SELECT EmployeeID FROM Employees AS Emp 
    WHERE Emp.EmployeeID = 42)
Does IN need to continue after it has found a match? Clearly NOT IN would need to, but why IN?
John Saunders
Thanks for the analysis! This is most of what I was looking for, particularly the theoretical or technical part (Exists stops after first hit). You'd think the MSDN books online would mention this.
EoRaptor013
Oh, and my examples aren't neccessarily good ones. I couldn't come up with something realistic off the top of my head, so I just cribbed from MS's sample code for EXISTS().
EoRaptor013
Correct John , my bad.Because we could have the clause reduced to "AND Orders.EmployeeID = 42" the most we could expect is only one row based on the primary key .If the sub clause was changed to "AS Emp WHERE Emp.Name = 'Smith'"my answer would make more sense.
NO, this answer is wrong. 1) When you say that "IN still has to find subsequent rows...", you assume that the database will perform a loop semijoin, but for large tables (where performance matters) it would probably hash join instead. 2) Both IN and EXISTs will likely stop at the first found row IF the query is performed as a loop join, and the IN predicate can be guaranteed to never be null (otherwise the semantics is different).
erikkallen
+2  A: 

Using an INNER JOIN would be faster than a subquery:

SELECT * 
  FROM Order o
 INNER JOIN Employees e ON o.EmployeeID = e.EmployeeID
 WHERE ShipRegion = 'WA'

Or with specific criteria:

 SELECT * 
  FROM Order o
 INNER JOIN Employees e ON o.EmployeeID = e.EmployeeID
 WHERE ShipRegion = 'WA'
   AND e.EmployeeID = 42
jn29098