views:

254

answers:

5

What is the most efficient way to write a select statement similar to the below.

SELECT *
FROM Orders
WHERE Orders.Order_ID not in (Select Order_ID FROM HeldOrders)

The gist is you want the records from one table when the item is not in another table.

+4  A: 

You can use a LEFT OUTER JOIN and check for NULL on the right table.

SELECT O1.*
FROM Orders O1
LEFT OUTER JOIN HeldOrders O2
ON O1.Order_ID = O2.Order_Id
WHERE O2.Order_Id IS NULL
pjp
This is *far* from being a most efficient method.
Quassnoi
This isn't going to necessarily be the most efficient method.
Crappy Coding Guy
It's significantly more effecient than a sub-query, though - at least it only executes against the second table once, instead of once/row.
rwmnau
`@rwmnau`: what gives you an idea that the second query will be executed more than once?
Quassnoi
@rwmnau: Don't you think the optimizer is clever enough to realize that it's an anti join? I wonder if it has ever been like that, at least it hasn't during the last 20 years.
erikkallen
I stand corrected - a little Control-L action has confirmed that for me. I had the idea that anything in the WHERE clause was always executed once for every row, but it's seems I'm mistaken. I suppose the best way to determine it is to always check the execution plan.
rwmnau
+15  A: 

For starters, a link to an old article in my blog on how NOT IN predicate works in SQL Server (and in other systems too):


You can rewrite it as follows:

SELECT  *
FROM    Orders o
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    HeldOrders ho
        WHERE   ho.OrderID = o.OrderID
        )

, however, most databases will treat these queries the same.

Both these queries will use some kind of an ANTI JOIN.

This is useful for SQL Server if you want to check two or more columns, since SQL Server does not support this syntax:

SELECT  *
FROM    Orders o
WHERE   (col1, col2) NOT IN
        (
        SELECT  col1, col2
        FROM    HeldOrders ho
        )

Note, however, that NOT IN may be tricky due to the way it treats NULL values.

If Held.Orders is nullable, no records are found and the subquery returns but a single NULL, the whole query will return nothing (both IN and NOT IN will evaluate to NULL in this case).

Consider these data:

Orders:

OrderID
---
1

HeldOrders:

OrderID
---
2
NULL

This query:

SELECT  *
FROM    Orders o
WHERE   OrderID NOT IN
        (
        SELECT  OrderID
        FROM    HeldOrders ho
        )

will return nothing, which is probably not what you'd expect.

However, this one:

SELECT  *
FROM    Orders o
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    HeldOrders ho
        WHERE   ho.OrderID = o.OrderID
        )

will return the row with OrderID = 1.

Note that LEFT JOIN solutions proposed by others is far from being a most efficient solution.

This query:

SELECT  *
FROM    Orders o
LEFT JOIN
        HeldOrders ho
ON      ho.OrderID = o.OrderID
WHERE   ho.OrderID IS NULL

will use a filter condition that will need to evaluate and filter out all matching rows which can be numerius

An ANTI JOIN method used by both IN and EXISTS will just need to make sure that a record does not exists once per each row in Orders, so it will eliminate all possible duplicates first:

  • NESTED LOOPS ANTI JOIN and MERGE ANTI JOIN will just skip the duplicates when evaluating HeldOrders.
  • A HASH ANTI JOIN will eliminate duplicates when building the hash table.
Quassnoi
First time I've seen a correlated subquery that actually needed to be a correlated subquery that I could grok in less than 5 minutes. Wish I'd known this trick *years* ago.
Philip Kelley
`@Philip Kelley`: which trick exactly?
Quassnoi
What do you mean in this section: "This is useful for SQL Server if you want to check two or more columns, since SQL Server does not support this syntax:". Are you saying this doesn't apply to SQL Server? Are you missing a "not"?
Stimy
`@Stimy`: sure, I missed a `NOT`. `SQL Server`, unlike `Oracle`, `MySQL` and `PostgreSQL`, does not support more than one column in an `IN` / `NOT IN` predicate.
Quassnoi
Using a NOT EXISTS on a correlated subquery, instead of doing it via an outer join like I've always done (particuarly for multiple-column lookups).
Philip Kelley
+1  A: 

I'm not sure what is the most efficient, but other options are:

  1. Use EXISTS

    SELECT * FROM ORDERS O WHERE NOT EXISTS (SELECT 1 FROM HeldOrders HO WHERE O.Order_ID = HO.OrderID)

  2. Use EXCEPT

    SELECT O.Order_ID FROM ORDERS O EXCEPT SELECT HO.Order_ID FROM HeldOrders

Alex Black
A: 

Try

SELECT *
FROM Orders
LEFT JOIN HeldOrders
ON HeldOrders.Order_ID = Orders.Order_ID
WHERE HeldOrders.Order_ID IS NULL
Jeff Hornby
+4  A: 

"Most efficient" is going to be different depending on tables sizes, indexes, and so on. In other words it's going to differ depending on the specific case you're using.

There are three ways I commonly use to accomplish what you want, depending on the situation.

1. Your example works fine if Orders.order_id is indexed, and HeldOrders is fairly small.

2. Another method is the "correlated subquery" which is a slight variation of what you have...

SELECT *
FROM Orders o
WHERE Orders.Order_ID not in (Select Order_ID 
                              FROM HeldOrders h 
                              where h.order_id = o.order_id)

Note the addition of the where clause. This tends to work better when HeldOrders has a large number of rows. Order_ID needs to be indexed in both tables.

3. Another method I use sometimes is left outer join...

SELECT *
FROM Orders o
left outer join HeldOrders h on h.order_id = o.order_id
where h.order_id is null

When using the left outer join, h.order_id will have a value in it matching o.order_id when there is a matching row. If there isn't a matching row, h.order_id will be NULL. By checking for the NULL values in the where clause you can filter on everything that doesn't have a match.

Each of these variations can work more or less efficiently in various scenarios.

Crappy Coding Guy
`@Dave`: why do you use `NOT IN` instead of `NOT EXISTS` in method `2`?
Quassnoi
@Quassnoi: Honestly, probably a bad habit. After reading your answer above I plan to start using NOT EXISTS.
Crappy Coding Guy
Option 3 worked best in my scenario (SQL Server 2000 given my tables indexes). I think the best answer is to test a number of methods.
Stimy