views:

191

answers:

5

I'm having problems figuring out a query for this scenario. I have two tables I want to use in this query, they are like this:

Units
  ID
  Other Data
People
  ID
  UnitID  <-- fk to Units
  Other Data

This is what I want to do:
I want to select all the units that do NOT have a row in the People table linked to them. How can I do this?

+3  A: 
SELECT * FROM Units  
WHERE NOT EXISTS  
(SELECT * FROM People WHERE UnitID = Units.ID)
n8wrl
+8  A: 
SELECT Units.* FROM Units LEFT JOIN People ON People.UnitID = Units.ID WHERE People.ID IS NULL

or alternatively

SELECT Units.* FROM Units WHERE NOT EXISTS (SELECT 1 FROM People WHERE People.UnitID = Units.ID)

or even

SELECT Units.* FROM Units WHERE Units.ID NOT IN (SELECT UnitID FROM People)
araqnid
which one is fastest?
Malfist
That can vary, depending on how big your tables are, and how big they are relative to each other, and which database version you have. I suggest you try all three in query analyzer. For small datasets there won't be any appreciable difference, and in fact the first two are often recognised by the DB as meaning the exact same thing.
araqnid
The People table will always be larger, but not by much. Neither of them will have more than 5,000 records.
Malfist
I would probably even stick the DISTINCT keyword into your third answer. ...NOT IN (SELECT DISTINCT UnitID FROM People). I like the third one as it's obviously shorter than the other two (no horizontal scrolling involved :))
Peter Perháč
It will not have duplicates in it. Adding distinct in this case would actually increase computation time. It it does have more than one person assigned to a unit there's been a major mistake somewhere.
Malfist
The third statement (on my DB at least) is faster for small sets of People, but gets much much worse when the data sets get larger, as it builds a hash table in a temp file. Given that the plan includes a hash, I suspect it's implicitly treating it as DISTINCT anyway.Of the other two, the "WHERE NOT EXISTS" syntax seems to produce slightly better results. All three are under 100ms though, so there isn't much to choose from. For larger sets (I created 2000 units and 1,000,000 people) the "EXISTS" pattern was significantly better. Always check the query plan, don't guess :)
araqnid
Your first and third queries are problematic.Query 1: The LEFT JOIN you listed doesn't have a DISTINCT in it and so it very likely will not return the same results as the semi-join.Query 3: WHERE IN(SELECT UnitID FROM People) is syntactically equivalent to:WHERE UnitID = <value> OR UnitID = <value> OR UnitID = <value> ...You can see this by examining execution plans of IN (1, 2, 3) vs. = 1 OR = 2 OR = 3.For very small sets of data using a list of OR can be best. But for anything larger, a JOIN will be far superior.
Emtucifor
@Emtucifor: I think both your objections are wrong.The left join filters result rows for which there is no matching row in People- hence the cardinality of the result is the same as Units. This is the same behaviour as the second query. Can you provide some example data to show how they are different?The third query... your point about x IN (1,2,3) being equivalent to "(x=1 OR x=2 OR x=3)" is misplaced. The values from People are not known at the time the query is written- writing ORs is not an option. It means the same as the WHERE clause, modulo null values in People.UnitID.
araqnid
I concur with your analysis araqnid. The third statement, the IN can be problematic if the UnitId column of the People table is ever null (I don't see a NOT NULL constraint defined, so I don't assume there is one enforced, because if not, sooner or later someone is going to insert a NULL and this query is going to stop returning results.) Adding an IS NOT NULL predicate in the subquery insulates me from that production problem popping up at the most inopportune time, and its a test case that a normal developer is going to run. I normally steer developers towards the EXISTS predicate.
spencer7593
@araqnid: My mistake. Since you're only returning the unmatched rows, no DISTINCT is necessary. IN() syntax does expand to OR, though my point was not that an explicit OR is preferred (which is impossible anyway as you pointed out) but that it's sloppy: 1) It can't be guaranteed the query will be converted to a JOIN, even though SQL 2005 and up do this usually/often, 2) the moment there's a compound key or the join uses two columns, this syntax completely blows up, and 3) Using IN() instead of JOINs is a bad mental habit that often reveals a core conceptual problem in the query writer.
Emtucifor
@araqnid: Upon more reflection, I wasn't completely wrong. Assuming UnitID is unique in the table, no DISTINCT is necessary. But this isn't always the case. Plus, if it wasn't unique, then the same problem would exist for the other queries as well, so I guess you're right that mentioning the DISTINCT problem wasn't necessary here.
Emtucifor
+2  A: 
SELECT * FROM Units
WHERE ID NOT IN (SELECT UnitID FROM People)
John Saunders
This is a bad habit. Use a join, not a long OR statement (IN expands to OR)
Emtucifor
Thanks, I did not know this. How does "IN (SELECT...)" translate into a series of "OR" operations?
John Saunders
Well, in this simple a query, the engine probably converts it to a JOIN. Check execution plan to find out for sure. But I was focusing on the bad-habit aspect of it. If it's going to do a JOIN then use a JOIN. IN () is best for cases where you know the list will never be more than a few. If you wouldn't be happy seeing it as a list of OR statements then don't write it this way.Here's another aspect to prove it's a bad habit. What do you do if you want another JOIN? Start nesting INs? That is a quick way to get on dailywtf.com
Emtucifor
+1  A: 

You can use a not in (select...)

SELECT ID
FROM Units
WHERE ID NOT IN (SELECT UnitID FROM People);
Jason T
FWIW, there's no benefit to using DISTINCT inside that subquery.
Bill Karwin
You are correct, I ran some execution plans and there is no benefit - even with large sets of data. Edited my answer and learned something ;) I was under the understanding that this would impact the speed of execution.
Jason T
+1  A: 
SELECT U.*
FROM Units U
WHERE NOT EXISTS (
   SELECT 1
   FROM People P
   WHERE U.ID = P.UnitID
)

Please note that this is called an (anti) semi-join. It is an actual join and is not a correlated subquery.

Another method commonly used is:

SELECT U.*
FROM
   Units U
   LEFT JOIN People P ON U.ID = P.UnitID
WHERE
   P.UnitID IS NULL

Note that additional criteria on the join (say you wanted to only join to people who were Active) need to be in the join clause. It won't work to say WHERE P.UnitID IS NULL AND P.Active = 1.

In my experience, each of the different queries can prove to be the performance winner depending on the actual execution plan chosen,. The way the engine uses statistics to predict row counts can make it choose sub-optimal execution plans for some queries, even when statistics are properly updated.

Note: using "SELECT 1" in your semi-joins instead of "SELECT *" will save some cycles during query compilation, since the * is actually expanded out to the column list, then later dropped.

Emtucifor