tags:

views:

129

answers:

4

These are tables I have:

Class
- id
- name

Order
- id
- name
- class_id (FK)

Family
- id
- order_id (FK)
- name

Genus
- id
- family_id (FK)
- name

Species
- id
- genus_id (FK)
- name

I'm trying to make a query to get a list of Class, Order, and Family names that does not have any Species under them. You can see that the table has some form of hierarchy from Order all the way down to Species. Each table has Foreign Key (FK) that relates to the immediate table above itself on the hierarchy.

Trying to get this at work, but I am not doing so well. Any help would be appreciated!

+3  A: 

Well, just giving this a quick and dirty shot, I'd write something like this. I spend most of my time using Firebird so the MySQL syntax may be a little different, but the idea should be clear

select f.name
from   family f left join genus g on f.id = g.family_id
       left join species s on g.id = species.genus_id
where  ( s.id is null )

if you want to enforce there being a genus then you just remove the "left" portion of the join from family to genus.

I hope I'm not misunderstanding the question and thus, leading you down the wrong path. Good luck!

edit: Actually, re-reading this I think this will just catch families where there's no species within a genus. You could add a " and ( g.id is null )" too, I think.

MattC
+1  A: 

Sub-select to the rescue...


select f.name from family as f, genus as g
where
  f.id == g.family_id and
  g.id not in (select genus_id from species);
Jeffrey Hulten
+1  A: 
SELECT f.name
FROM   family f
WHERE  NOT EXISTS (
       SELECT  1
       FROM    genus g 
       JOIN    species s
       ON      g.id = s.genus_id
       WHERE   g.family_id = f.id
       )

Note, than unlike pure LEFT JOIN solutions, this is more efficient.

It does not select ALL rows filtering out those with NOT NULL values, but instead selects at most one row from genus and species.

Quassnoi
Were you and I separated at birth, Quassnoi?
tpdi
+5  A: 

Meta-answer (comment on the two previous answers):

Using IN tends to degrade to something very like an OR (a disjunction) of all terms in the IN. Bad performance.

Doing a left join and looking for null is an improvement, but it's obscurantist. If we can say what we mean, let's say it in a wau that's clossest to how we'd say it in natural language:

select f.name
from   family f left join genus g on f.id = g.family_id
       WHERE NOT EXISTS (select * from species c where c.id = g.id);

We want where something doesn't exist, so if we can say "where not exists" all the better. And, the select * in the subquery doesn't mean it's really bringing back a whole row, so it's not an "optimization" to replace select * with select 1, at least not on any modern RDBMS.

Further, where a family has many genera (and in biology, most families do), we're going to get one row per (family, genus) when all we care about is the family. So let's get one row per family:

select DISTINCT f.name
from   family f left join genus g on f.id = g.family_id
       WHERE NOT EXISTS (select * from species c where c.id = g.id);

This is still not optimal. Why? Well it fulfills the OP's requirement, in that it finds "empty" genera, but it fails to find families that have no genera, "empty" families. Can we make it do that too?

select f.name
from   family f 
       WHERE NOT EXISTS (
       select * from genus g 
       join species c on c.id = g.id 
       where g.id = f.id);

We can even get rid of the distinct, because we're not joining family to anything. And that is an optimization.

Comment from OP:

That was a very lucid explanation. However, I'm curious as to why using IN or disjunctions is bad for performance. Can you elaborate on that or point me to a resource where I can learn more about the relative performance cost of different DB operations?

Think of it this way. Say that there was not IN operator in SQL. How would you fake an IN?

By a series of ORs:

where foo in (1, 2, 3)

is equivalent to

where ( foo = 1 ) or ( foo = 2 ) or (foo = 3 )

Ok, you say, but that still doesn't tell me why it's bad. It's bad because there's often no decent way to use a key or index to look this up. So what you get is either a) a table scan, where for each disjunction (or'd predicate or element of an IN list), the row gets tested, until a test is true or the list is exhausted. Or b) you get a table scan for each of these disjunctions. The second case (b) may actually be better, which is why you sometimes see a select with an OR turned into one select for each leg of the OR union'd together:

 select * from table where x = 1 or x = 3 ;

 select * from table where x = 1 
 union select * from table where x = 3 ;

Now this is not to say you can never use an OR or an IN list. And in some cases, the query optimizer is smart enough to turn an IN list into a join -- and the other answers you were given are precisely the cases where that's most likely.

But if we can explicitly turn our query into a join, well, we don't have to wonder if the query optimizer is smart. And in general, joins are what the databse is best at doing.

tpdi
That was a very lucid explanation. However, I'm curious as to why using IN or disjunctions is bad for performance. Can you elaborate on that or point me to a resource where I can learn more about the relative performance cost of different DB operations?
Calvin