tags:

views:

1021

answers:

8

This has bugged me for a long time.

99% of the time, the GROUP BY clause is an exact copy of the SELECT clause, minus the aggregate functions (MAX, SUM, etc.).
This breaks the Don't Repeat Yourself principle.

Is there a good reason for this?

edit

I realise that some implementations allow you to have different fields in the GROUP BY than in the SELECT (hence 99%, not 100%), but surely that's a very minor exception?
Can someone explain what is supposed to be returned if you use different fields?

Thanks.

+1  A: 

Actually, wouldn't that be 100% of the time? Is there a case in which you can have a (non-aggregate) column in the select that is not in the GROUP BY?

I don't have an answer though. It certainly does seem like a awkward moment for the language.

MarkB
MySQL allows this...
gms8994
+6  A: 

Because they are two different things, you can group by items that aren't in the select clause

EDIT:

Also, is it safe to make that assumption?

I have a SQL statement

Select ClientName, InvAmt, Sum(PayAmt) as PayTot

Is it "correct" for the server to assume I want to group by ClientName AND InvoiceAmount? I personally prefer (and think it's safer) to have this code

Select ClientName, InvAmt, Sum(PayAmt) as PayTot
Group By ClientName

throw an error, prompting me to change the code to

Select ClientName, Sum(InvAmt) as InvTot, Sum(PayAmt) as PayTot
Group By ClientName
Binary Worrier
That's true, but why require the columns in the SELECT list if they're mandatory? `SELECT a, MAX(c) FROM t GROUP BY b` could imply grouping by a, couldn't it? I think it may just be a clarity issue.
Mike Woodhouse
Implied operations are the bane of debugging and testing.
Robert C. Barth
+3  A: 

I hope/expect we'll see something more comprehensive soon; a SQL history lesson on the subject would be useful and informative. Anyone? Anyone? Bueller?

In the meantime, I can observe the following:

SQL predates the DRY principle, at least as far as it it was documented in The Pragmatic Programmer.

Not all DBs require the full list: Sybase, for example, will happily execute queries like

SELECT a, b, COUNT(*)
FROM some_table
GROUP BY a

... which (at least every time I accidentally ran such a monster) often leads to such enormous inadvertent recordsets that panic-stricken requests quickly ensue, begging the DBAs to bounce the server. The result is a sort of partial Cartesian product, but I think it may mostly be a failure on Sybase's part to implement the SQL standard properly.

Mike Woodhouse
+3  A: 

I tend to agree with you - this is one of many cases where SQL should have slightly smarter defaults to save us all some typing. For example, imagine if this were legal:

Select ClientName, InvoiceAmount, Sum(PaymentAmount) Group By *

where "*" meant "all the non-aggregate fields". If everybody knew that's how it worked, then there would be no confusion. You could sub in a specific list of fields if you wanted to do something tricky, but the splat means "all of 'em" (which in this context means, all the possible ones).

Granted, "*" means something different here than in the SELECT clause, so maybe a different character would work better:

Select ClientName, InvoiceAmount, Sum(PaymentAmount) Group By !

There are a few other areas like that where SQL just isn't as eloquent as it could be. But at this point, it's probably too entrenched to make many big changes like that.

Ian Varley
yes. thank you. that sort of thing sounds sensible. I guess the answer is "there isn't really an answer". ;)
AJ
Difficult though to select a meaningful character or keyword. The * or "ALL" keyword isn't semantically correct, as you're not grouping by ALL (you're grouping by all EXCEPT . . .). It's not a bad idea, can't see how it would work without being a semantic kludge though.
Binary Worrier
Why not just GROUP instead of GROUP BY <some wildcard>?
Martijn
Sure, that would work too, but I suspect that might be less than intuitive for current SQL programmers. For that matter, the syntax could also be to leave it off entirely, and the parser would know that you're supposed to automatically group by any non-aggregate fields. That's probably how it should have been done in the first place, IMO, but maybe that's getting into "trying to be so helpful that you end up being confusing" territory.
Ian Varley
A: 

The good reason for it is that you would get incorrect results more often than not if you did not specify all columns. Suppose you have three columns, col1, col2 and col3. Suppose your data looks like this:

col1 Col2 Col3 a b 1 a c 1 b b 2 a b 3

select cola, colb, sum(col3) from mytable group by cola, colb would give the following results

col1 Col2 Col3 a b 4 a c 1 b b 2 How would it interpret select cola, colb, sum(col3) from mytable group by cola

my guess would be

col1 Col2 Col3 a b 5 a c 5 b b 2

These are clearly bad results. Of course the more complex the query and the more joins the less likely it would be that the query would return correct results or that the programmer would even know if they were incorrect. Personally I'm glad that grouop by requires the fields.

HLGEM
+1  A: 

I share the op's view that repeating is a bit annoying, especially if the non-aggregate fields contain elaborate statements like ifs and funtions and a whole lot of other things. It would be nice if there could be some shorthand in the group by clause - at least a column alias. Referring to the columns by number may be another option, albeit one that probably has their own problems.

IronGoofy
nice point about long expressions. it does add to the pain.
AJ
+2  A: 

Perhaps we need a shorthand form - call it GroupSelect

GroupSelect Field1, Field2, sum(Field3) From SomeTable Where (X = "3")

This way, the parser need only throw an error if you leave out an aggregate function.

Peter LaComb Jr.
Hmmm. Nice idea.
AJ
A: 

The short answer is SQL does have a shortcut for grouping by all the select columns. It is

SELECT DISTINCT ... FROM ...

fredt
And if you want to include any aggregate functions?
Martin Smith