Meta-answer (comment on the two previous answers):
Using IN tends to degrade to something very like an OR (a disjunction) of all terms in the IN. Bad performance.
Doing a left join and looking for null is an improvement, but it's obscurantist. If we can say what we mean, let's say it in a wau that's clossest to how we'd say it in natural language:
select f.name
from family f left join genus g on f.id = g.family_id
WHERE NOT EXISTS (select * from species c where c.id = g.id);
We want where something doesn't exist, so if we can say "where not exists" all the better. And, the select *
in the subquery doesn't mean it's really bringing back a whole row, so it's not an "optimization" to replace select *
with select 1
, at least not on any modern RDBMS.
Further, where a family has many genera (and in biology, most families do), we're going to get one row per (family, genus) when all we care about is the family. So let's get one row per family:
select DISTINCT f.name
from family f left join genus g on f.id = g.family_id
WHERE NOT EXISTS (select * from species c where c.id = g.id);
This is still not optimal. Why? Well it fulfills the OP's requirement, in that it finds "empty" genera, but it fails to find families that have no genera, "empty" families. Can we make it do that too?
select f.name
from family f
WHERE NOT EXISTS (
select * from genus g
join species c on c.id = g.id
where g.id = f.id);
We can even get rid of the distinct, because we're not joining family to anything. And that is an optimization.
Comment from OP:
That was a very lucid explanation. However, I'm curious as to why using IN or disjunctions is bad for performance. Can you elaborate on that or point me to a resource where I can learn more about the relative performance cost of different DB operations?
Think of it this way. Say that there was not IN operator in SQL. How would you fake an IN?
By a series of ORs:
where foo in (1, 2, 3)
is equivalent to
where ( foo = 1 ) or ( foo = 2 ) or (foo = 3 )
Ok, you say, but that still doesn't tell me why it's bad. It's bad because there's often no decent way to use a key or index to look this up. So what you get is either a) a table scan, where for each disjunction (or'd predicate or element of an IN list), the row gets tested, until a test is true or the list is exhausted. Or b) you get a table scan for each of these disjunctions. The second case (b) may actually be better, which is why you sometimes see a select with an OR turned into one select for each leg of the OR union'd together:
select * from table where x = 1 or x = 3 ;
select * from table where x = 1
union select * from table where x = 3 ;
Now this is not to say you can never use an OR or an IN list. And in some cases, the query optimizer is smart enough to turn an IN list into a join -- and the other answers you were given are precisely the cases where that's most likely.
But if we can explicitly turn our query into a join, well, we don't have to wonder if the query optimizer is smart. And in general, joins are what the databse is best at doing.