ansaurus

Question

Filter a one-to-many query by requiring all of many meet criteria.

Answer 1

+4 A:

SELECT b.*
FROM boxes b JOIN thingsinboxes t ON (b.id = t.box_id)
GROUP BY b.id
HAVING COUNT(DISTINCT t.thing) = 1 AND SUM(t.thing = 'orange') > 0;

Here's another solution that does not use GROUP BY:

SELECT DISTINCT b.*
FROM boxes b
  JOIN thingsinboxes t1 
    ON (b.id = t1.box_id AND t1.thing = 'orange')
  LEFT OUTER JOIN thingsinboxes t2 
    ON (b.id = t2.box_id AND t2.thing != 'orange')
WHERE t2.box_id IS NULL;

As always, before you make conclusions about the scalability or performance of a query, you have to try it with a realistic data set, and measure the performance.

Bill Karwin 2009-01-26 22:19:54

Because HAVING is run after everything else, this query builds a giant temporary table and then runs filters on it. In the scalability scenario, above, this is an extremely expensive approach. Surely there's something more efficient?

Sam 2009-01-26 22:27:26

Sam: I doubt the query optimiser will construct a big temporary table -- since it knows it needs to GROUP BY b.id, it can generate one row at a time in b.id order, and keep track of the number of distinct things and the number of oranges in the last span of rows having identical b.id.

j_random_hacker 2009-01-26 22:44:58

@Sam: You constrained the solution by saying you didn't want to use subqueries.

Bill Karwin 2009-01-26 22:51:22

The second query performs vastly better (by orders of magnitude) than the first, at least in my situation, probably because of the sheer size of the tables.It's also reasonably fast when the search is a regex instead of =. It may be equivalent to a subquery, but mysql doesn't choke on it.

Sam 2009-01-27 00:27:03

Answer 2

+2 A:

I think Bill Karwin's query is just fine, however if a relatively small proportion of boxes contain oranges, you should be able to speed things up by using an index on the thing field:

SELECT b.*
FROM boxes b JOIN thingsinboxes t1 ON (b.id = t1.box_id)
WHERE t1.thing = 'orange'
AND NOT EXISTS (
    SELECT 1
    FROM thingsinboxes t2
    WHERE t2.box_id = b.id
    AND t2.thing <> 'orange'
)
GROUP BY t1.box_id

The WHERE NOT EXISTS subquery will only be run once per orange thing, so it's not too expensive provided there aren't many oranges.

j_random_hacker 2009-01-26 22:45:05

This one hit the spot for me.

thethinman 2009-12-11 21:50:56

ansaurus

tags:

views:

answers:

Filter a one-to-many query by requiring all of many meet criteria.

related questions