views:

359

answers:

2

Let's assume three models, standard joins:

class Mailbox < ActiveRecord::Base
  has_many :addresses
  has_many :domains, :through => :addresses
end

class Address < ActiveRecord::Base
  belongs_to :mailbox
  belongs_to :domain
end

class Domain < ActiveRecord::Base
  has_many :addresses
  has_many :mailboxes, :through => :addresses
end

Now obviously, if for any given mailbox you want to know in which domains it has addresses, you have two possible ways to go:

m = Mailbox.first
# either: SELECT DISTINCT domains.id, domains.name FROM "domains" INNER JOIN 
#         "addresses" ON "domains".id = "addresses".domain_id WHERE 
#         (("addresses".mailbox_id = 1))
m.domains.all(:select => 'DISTINCT domains.id, domains.name')
# or: SELECT domains.id, domains.name FROM "domains" INNER JOIN "addresses" ON
#     "domains".id = "addresses".domain_id WHERE (("addresses".mailbox_id = 1))
#      GROUP BY domains.id, domains.name
m.domains.all(:select => 'domains.id, domains.name', 
  :group => 'domains.id, domains.name')

The problem for me is that I don't know which solution is better. When I don't specify any other conditions, the PostgreSQL query planner favours solution number two (working as expected), but if I add conditions to the queries, it comes down to "Unique" vs. "Group":

With "DISTINCT":

 Unique  (cost=16.56..16.57 rows=1 width=150)
   ->  Sort  (cost=16.56..16.56 rows=1 width=150)
         Sort Key: domains.name, domains.id
         ->  Nested Loop  (cost=0.00..16.55 rows=1 width=150)
               ->  Index Scan using index_addresses_on_mailbox_id on addresses  (cost=0.00..8.27 rows=1 width=4)
                     Index Cond: (mailbox_id = 1)
               ->  Index Scan using domains_pkey on domains  (cost=0.00..8.27 rows=1 width=150)
                     Index Cond: (domains.id = addresses.domain_id)
                     Filter: (domains.active AND domains.selfmgmt)
(9 rows)

With "GROUP BY":

Group  (cost=16.56..16.57 rows=1 width=150)
   ->  Sort  (cost=16.56..16.56 rows=1 width=150)
         Sort Key: domains.name, domains.id
         ->  Nested Loop  (cost=0.00..16.55 rows=1 width=150)
               ->  Index Scan using index_addresses_on_mailbox_id on addresses  (cost=0.00..8.27 rows=1 width=4)
                     Index Cond: (mailbox_id = 1)
               ->  Index Scan using domains_pkey on domains  (cost=0.00..8.27 rows=1 width=150)
                     Index Cond: (domains.id = addresses.domain_id)
                     Filter: (domains.active AND domains.selfmgmt)
(9 rows)

I'm really unsure how to determine the better way to retrieve these data. My instincts tell me to go with "GROUP BY", but I was unable to find any documentation specific enough to solve this problem.

Should I use ":group" or ":select => 'DISTINCT'"? Is that choice the same with other modern RDBMS like e.g. Oracle, DB2 or MySQL (I don't have access to those, so I can't perform tests)?

+1  A: 

SQL has to do perform the same steps to gather the required information to prepare for a GROUP BY or for the 'equivalent' DISTINCT query: same filters, same sorting etc. The difference is in the final step, which here Postgre calls 'Group' or 'Unique'.

When a GROUP BY query does not include any aggregate in its SELECT, 'Group' and 'Unique' are essentially the same thing (even if they boil down to distinct methods in the code of the server). In the presence of some aggregate such as say COUNT(*), MAX(some_field) etc. the 'Group' step is a bit more resource intensive as it needs extra storage to store the tally, the min etc. and it needs to increment or compare with it etc.

I would go with the GROUP BY approach, as it may would to add aggregates in the SELECT, if such info was desired at some time, without changing the query much. Even if the actual count was not displayed it may be nice for the application to show the domains in descending order of this count.

mjv
the difference is in implementation, and in PostgreSQL < 8.4 group by is usually much faster.
depesz
+5  A: 

If you're using Postgresql < 8.4 (which I guess you are, given the plans) - it's usually better to use GROUP BY instead of DISTINCT as its plan is simply more efficient.

In 8.4 there is no difference as DISTINCT was "taught" to be able to use group operators as well.

depesz
Thanks for your answer. However, is that also true for other database systems like e.g. DB2? I've edited the question to emphasize that I'm not only interested in pgsql specific hints.
cite
Oracle (at least that I'm aware since 9i) operates in the same manner - there's no difference in an explain plan when using GROUP BY or DISTINCT. I'd have to look at asktom for the article supporting that.
OMG Ponies