views:

818

answers:

2

Imagine the following tables:

create table boxes( id int, name text, ...);

create table thingsinboxes( id int, box_id int, thing enum('apple,'banana','orange');

And the tables look like:

Boxes:
id | name
1  | orangesOnly
2  | orangesOnly2
3  | orangesBananas
4  | misc

thingsinboxes:
id | box_id | thing
1  |  1     | orange
2  |  1     | orange
3  |  2     | orange
4  |  3     | orange
5  |  3     | banana
6  |  4     | orange
7  |  4     | apple
8  |  4     | banana

How do I select the boxes that contain at least one orange and nothing that isn't an orange?

How does this scale, assuming I have several hundred thousand boxes and possibly a million things in boxes?

I'd like to keep this all in SQL if possible, rather than post-processing the result set with a script.

I'm using both postgres and mysql, so subqueries are probably bad, given that mysql doesn't optimize subqueries (pre version 6, anyway).

+4  A: 
SELECT b.*
FROM boxes b JOIN thingsinboxes t ON (b.id = t.box_id)
GROUP BY b.id
HAVING COUNT(DISTINCT t.thing) = 1 AND SUM(t.thing = 'orange') > 0;


Here's another solution that does not use GROUP BY:

SELECT DISTINCT b.*
FROM boxes b
  JOIN thingsinboxes t1 
    ON (b.id = t1.box_id AND t1.thing = 'orange')
  LEFT OUTER JOIN thingsinboxes t2 
    ON (b.id = t2.box_id AND t2.thing != 'orange')
WHERE t2.box_id IS NULL;

As always, before you make conclusions about the scalability or performance of a query, you have to try it with a realistic data set, and measure the performance.

Bill Karwin
Because HAVING is run after everything else, this query builds a giant temporary table and then runs filters on it. In the scalability scenario, above, this is an extremely expensive approach. Surely there's something more efficient?
Sam
Sam: I doubt the query optimiser will construct a big temporary table -- since it knows it needs to GROUP BY b.id, it can generate one row at a time in b.id order, and keep track of the number of distinct things and the number of oranges in the last span of rows having identical b.id.
j_random_hacker
@Sam: You constrained the solution by saying you didn't want to use subqueries.
Bill Karwin
The second query performs vastly better (by orders of magnitude) than the first, at least in my situation, probably because of the sheer size of the tables.It's also reasonably fast when the search is a regex instead of =. It may be equivalent to a subquery, but mysql doesn't choke on it.
Sam
+2  A: 

I think Bill Karwin's query is just fine, however if a relatively small proportion of boxes contain oranges, you should be able to speed things up by using an index on the thing field:

SELECT b.*
FROM boxes b JOIN thingsinboxes t1 ON (b.id = t1.box_id)
WHERE t1.thing = 'orange'
AND NOT EXISTS (
    SELECT 1
    FROM thingsinboxes t2
    WHERE t2.box_id = b.id
    AND t2.thing <> 'orange'
)
GROUP BY t1.box_id

The WHERE NOT EXISTS subquery will only be run once per orange thing, so it's not too expensive provided there aren't many oranges.

j_random_hacker
This one hit the spot for me.
thethinman