views:

226

answers:

5

I'm trying to make sense of the right way to use JOIN, COUNT(*), and GROUP BY to do a pretty simple query. I've actually gotten it to work (see below) but from what I've read, I'm using an extra GROUP BY that I shouldn't be.

(Note: The problem below isn't my actual problem (which deals with more complicated tables), but I've tried to come up with an analogous problem)

I have two tables:

Table: Person
-------------
key  name     cityKey
1    Alice    1
2    Bob      2
3    Charles  2
4    David    1

Table: City
-------------
key  name
1    Albany
2    Berkeley
3    Chico

I'd like to do a query on the People (with some WHERE clause) that returns

  • the number of matching people in each city
  • the key for the city
  • the name of the city.

If I do

SELECT COUNT(Person.key) AS count, City.key AS cityKey, City.name AS cityName
FROM Person 
LEFT JOIN City ON Person.cityKey = City.key 
GROUP BY Person.cityKey, City.name

I get the result that I want

count   cityKey   cityName
2       1         Albany
2       2         Berkeley

However, I've read that throwing in that last part of the GROUP BY clause (City.name) just to make it work is wrong.

So what's the right way to do this? I've been trying to google for an answer, but I feel like there's something fundamental that I'm just not getting.

A: 

Your query would only work on MySQL, because you group on Person.cityKey but select city.key. All other databases would require you to use an aggregate like min(city.key), or to add City.key to the group by clause.

Because the combination of city name and city key is unique, the following are equivalent:

select    count(person.key), min(city.key), min(city.name)
...
group by  person.citykey

Or:

select    count(person.key), city.key, city.name
...
group by  person.citykey, city.key, city.name

Or:

select    count(person.key), city.key, max(city.name)
...
group by  city.key

All rows in the group will have the same city name and key, so it doesn't matter if you use the max or min aggregate.

P.S. If you'd like to count only different persons, even if they have multiple rows, try:

count(DISTINCT person.key)

instead of

count(person.key)
Andomar
I think that may be a typo in the question; I believe that he intended his "group by" to include both the non-aggregated columns in his "SELECT" clause. I could be wrong however.
Pointy
For your second example, you have a bunch of things in the `group by` part; from the page I linked in my post, they were saying that that was the wrong thing to do. However, the `MIN` part works; (I thought that only worked for `int` type columns.)
Craig S
@Craig S: If I read the post you linked to correctly, they're just saying you should `group by` on columns you select. My second example does that correctly. The post also refers to the "annoying error" you get if you do it incorrectly. MySQL is the only database that would accept the "incorrect" way without an "annoying error" :)
Andomar
+1  A: 

I don't think that it's "wrong" in this case, because you've got a one-to-one relationship between city name and city key. You could rewrite it such that you join to a sub-select to get the count of persons to cities by key, to the city table again for the name, but it's debatable that that'd be better. It's a matter of style and opinion I guess.

select PC.ct, City.key, City.name
  from City
  join (select count(Person.key) ct, cityKey key from Person group by cityKey) PC
    on City.key = PC.key

if my SQL isn't too rusty :-)

Pointy
A: 

However, I've read that throwing in that last part of the GROUP BY clause (City.name) just to make it work is wrong.

It's not wrong. You have to understand how the Query Optimizer sees your query. The order in which it is parsed is what requires you to "throw the last part in." The optimizer sees your query in something akin to this order:

  • the required tables are joined
  • the composite dataset is filtered through the WHERE clause
  • the remaining rows are chopped into groups by the GROUP BY clause, and aggregated
  • they are then filtered again, through the HAVING clause
  • finally operated on, by SELECT / ORDER BY, UPDATE or DELETE.

The point here is that it's not that the GROUP BY has to name all the columns in the SELECT, but in fact it is the opposite - the SELECT cannot include any columns not already in the GROUP BY.

ahsteele
Except on MySQL :)
Andomar
Is it better to use an aggregate function than to add something to the group by?
Craig S
@Andomar very true but I thought we were talking about SQL in general here ;)
ahsteele
@Craig that depends on your need. If you don't need to perform a function on the data then no you shouldn't be putting it into a function and should be grouping on it. The only reason you get the either an aggregate function or the GROUP BY clause error is because the optimizer needs to know what you are doing and it has to be either perform a function on this column or group by it. You shouldn't use a function just to avoid the group by.
ahsteele
+1  A: 

...I've read that throwing in that last part of the GROUP BY clause (City.name) just to make it work is wrong.

You misunderstand, you got it backwards.
Standard SQL requires you to specify in the GROUP BY all the columns mentioned in the SELECT that are not wrapped in aggregate functions. If you don't want certain columns in the GROUP BY, wrap them in aggregate functions. Depending on the database, you could use the analytic/windowing function OVER...

However, MySQL and SQLite provide the "feature" where you can omit these columns from the group by - which leads to no end of "why doesn't this port from MySQL to fill_in_the_blank database?!" Stackoverflow and numerous other sites & forums.

OMG Ponies
Is adding `City.name` to the `Group By` "better" (i.e. faster, more portable, etc) than using `Min(city.name)` as suggested by Andomar?
Craig S
@Craig S: Depends on your needs. But using an aggregate function for the sake of not having to do it in the GROUP BY is risky -- it'd require more testing to ensure things are correct. I don't see the point of using an aggregate solely for the sake of not having to define it in the GROUP BY.
OMG Ponies
@OMG Ponies: Okay; I guess after reading that article, I was worried that I was just "adding the group by to make it work" and that in doing so I was making a big SQL mistake. I'm operating on a relatively huge database (if it were the tables in my example, there would be millions of people, and hundreds of cities)
Craig S