views:

55

answers:

2

I have two models, Project and Category, which have a many-to-many relationship between them. The Project model is very simple:

class Project < ActiveRecord::Base
  has_and_belongs_to_many :categories

  scope :in_categories, lambda { |categories|
    joins(:categories).
    where("categories.id in (?)", categories.collect(&:to_i))
  }
end

The :in_categories scope takes an array of Category IDs (as strings), so using this scope I can get back every project that belongs to at least one of the categories passed in.

But what I'm actually trying to do is filter (a better name would be :has_categories). I want to just get the projects that belong to all of the categories passed in. So if I pass in ["1", "3", "4"] I only want to get the projects that belong to all of the categories.

+1  A: 

There are two common solutions in SQL to do what you're describing.

Self-join:

SELECT ...
FROM Projects p
JOIN Categories c1 ON c1.project_id = p.id
JOIN Categories c3 ON c3.project_id = p.id
JOIN Categories c4 ON c4.project_id = p.id
WHERE (c1.id, c3.id, c4.id) = (1, 3, 4);

Note I'm using syntax to compare tuples. This is equivalent to:

WHERE c1.id = 1 AND c3.id = 3 AND c4.id = 4;

In general, the self-join solution has very good performance if you have a covering index. Probably Categories.(project_id,id) would be the right index, but analyze the SQL with EXPLAIN to be sure.

The disadvantage of this method is that you need four joins if you're searching for projects that match four different categories. Five joins for five categories, etc.

Group-by:

SELECT ...
FROM Projects p
JOIN Categories cc ON c.project_id = p.id
WHERE c.id IN (1, 3, 4)
GROUP BY p.id
HAVING COUNT(*) = 3;

If you're using MySQL (I assume you are), most GROUP BY queries invoke a temp table and this kills performance.

I'll leave it as an exercise for you to adapt one of these SQL solutions to equivalent Rails ActiveRecord API.

Bill Karwin
Huh, I would never have guessed that was the solution. Thanks!
Brandon Weiss
+1  A: 

It seems like in ActiveRecord you would do it like so:

scope :has_categories, lambda { |categories|
  joins(:categories).
  where("categories.id in (?)", categories.collect(&:to_i)).
  group("projects.id HAVING COUNT(projects.id) = #{categories.count}")
}
Brandon Weiss
but doesn't this fail if there are two projects in category A, one in B and none in C? Or is that handled by a data constraint?
Geoff Lanotte
I don't think so . . . take for instance the [1, 3, 4] example. Without the group query, that scope will get any project that is associated with any of those 3 categories. But by adding on the group query, you're only getting the projects that belong to 3 categories (or however many are in the array passed in). Which necessarily means that you've gotten the projects that belong to all of them at once. Does that make sense? I'm not sure if that's the right answer, but it logically makes sense to me, and all my tests pass, so it seems to work properly.
Brandon Weiss