views:

3094

answers:

17

I learned something simple about SQL the other day:

SELECT c FROM myTbl GROUP BY C

Has the same result as:

SELECT DISTINCT C FROM myTble

What I am curious of, is there anything different in the way an SQL engine processes the command, or are they truly the same thing?

I personally prefer the distinct syntax, but I am sure it's more out of habit than anything else.

EDIT: This is not a question about aggregates. The use of GROUP BY with aggregate functions is understood.

+2  A: 

They have different semantics, even if they happen to have equivalent results on your particular data.

Hank Gay
+26  A: 

"group by" lets you use aggregate functions, like avg, max, min, sum, and count. "distinct" just removes duplicates.

For example, if you have a bunch of purchase records, and you want to know how much was spent by each department, you might do something like:

select department, sum(amount) from purchases group by department

This will give you one row per department, containing the department name and the sum of all of the "amount" values in all rows for that department.

Glomek
The use of GROUP BY I understand, The question is based on the fact that it returns a distinct dataset when no aggregate function is present.
Brettski
Because GROUP BY implicitly does a DISTINCT over the values of the column you're grouping by (sorry for the cacophony).
Joe Pineda
Thank you Joe, that was actually a good description why. Not written that way in explanations of GROUP BY
Brettski
+7  A: 

Use DISTINCT if you just want to remove duplicates. Use GROUPY BY if you want to apply aggregate operators (MAX, SUM, GROUP_CONCAT, ..., or a HAVING clause).

jkramer
+7  A: 

For the query you posted, they are identical. But for other queries that may not be true.

For example, it's not the same as:

SELECT C FROM myTbl GROUP BY C, D
Joel Coehoorn
+2  A: 

If you use DISTINCT with multiple columns, the result set won't be grouped as it will with GROUP BY, and you can't use aggregate functions with DISTINCT.

Bill the Lizard
+2  A: 

In that particular query there is no difference. But, of course, if you add any aggregate columns then you'll have to use group by.

Jeffrey L Whitledge
+3  A: 

group by is used in aggregate operations -- like when you want to get a count of Bs broken down by column C

select C, count(B) from myTbl group by C

distinct is what it sounds like -- you get unique rows.

In sql server 2005, it looks like the query optimizer is able to optimize away the difference in the simplistic examples I ran. Dunno if you can count on that in all situations, though.

Danimal
+1  A: 

You're only noticing that because you are selecting a single column.

Try selecting two fields and see what happens.

Group By is intended to be used like this:

SELECT name, SUM(transaction) FROM myTbl GROUP BY name

Which would show the sum of all transactions for each person.

chris
This is not a question of aggregates. In your example, SELECT c, d FROM mytbl GROUP BY C, D; will in fact return the same data set as SELECT DISTINCT C, D FROM mytbl; This is the fundamentals of the question
Brettski
+2  A: 

GROUP BY has a very specific meaning that is distinct (heh) from the DISTINCT function.

GROUP BY causes the query results to be grouped using the chosen expression, aggregate functions can then be applied, and these will act on each group, rather than the entire resultset.

Here's an example that might help:

Given a table that looks like this:

name
------
barry
dave
bill
dave
dave
barry
john

This query:

SELECT name, count(*) AS count FROM table GROUP BY name;

Will produce output like this:

name    count
-------------
barry   2
dave    3
bill    1
john    1

Which is obviously very different from using DISTINCT. If you want to group your results, use GROUP BY, if you just want a unique list of a specific column, use DISTINCT. This will give your database a chance to optimise the query for your needs.

Dan
+6  A: 

There is no difference (in SQL Server, at least). Both queries use the same execution plan.

http://www.sqlmag.com/Article/ArticleID/24282/sql_server_24282.html

Maybe there is a difference, if there are sub-queries involved:

http://blog.sqlauthority.com/2007/03/29/sql-server-difference-between-distinct-and-group-by-distinct-vs-group-by/

There is no difference (Oracle-style):

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:32961403234212

MusiGenesis
+10  A: 

I expect there is the possibility for subtle differences in their execution. I checked the execution plans for two functionally equivalent queries along these lines in Oracle 10g:

core> select sta from zip group by sta;

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    58 |   174 |    44  (19)| 00:00:01 |
|   1 |  HASH GROUP BY     |      |    58 |   174 |    44  (19)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| ZIP  | 42303 |   123K|    38   (6)| 00:00:01 |
---------------------------------------------------------------------------

core> select distinct sta from zip;

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    58 |   174 |    44  (19)| 00:00:01 |
|   1 |  HASH UNIQUE       |      |    58 |   174 |    44  (19)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| ZIP  | 42303 |   123K|    38   (6)| 00:00:01 |
---------------------------------------------------------------------------

The middle operation is slightly different: "HASH GROUP BY" vs. "HASH UNIQUE", but the estimated costs etc. are identical. I then executed these with tracing on and the actual operation counts were the same for both (except that the second one didn't have to do any physical reads due to caching).

But I think that because the operation names are different, the execution would follow somewhat different code paths and that opens the possibility of more significant differences.

I think you should prefer the DISTINCT syntax for this purpose. It's not just habit, it more clearly indicates the purpose of the query.

Dave Costa
+14  A: 

MusiGenesis' response is functionally the correct one with regard to your question as stated; the SQL Server is smart enough to realize that if you are using "Group By" and not using any aggregate functions, then what you actually mean is "Distinct" - and therefore it generates an execution plan as if you'd simply used "Distinct."

However,I think it's important to note Hank's response as well - cavalier treatment of "Group By" and "Distinct" could lead to some pernicious gotcha's down the line if you're not careful. It's not entirely correct to say that this is "not a question about aggregates" because you're asking about the functional difference between two SQL query keywords, one of which is meant to be used with aggregates and one of which is not.

A hammer can work to drive in a screw sometimes , but if you've got a screwdriver handy, why bother?

Skeolan
Thanks for my band's new name: The Pernicious Gotchas.
MusiGenesis
I am in complete agreement with you Skeolan. I was quite surprised when I came across this functionality. It isn't something I plan to use, but a way things have been done at this new place I am working at.
Brettski
+1  A: 

Please don't use GROUP BY when you mean DISTINCT, even if they happen to work the same. I'm assuming you're trying to shave off milliseconds from queries, and I have to point out that developer time is orders of magnitude more expensive than computer time.

Andy Lester
A: 

The way I always understood it is that using distinct is the same as grouping by every field you selected in the order you selected them.

i.e:

select distinct a, b, c from table;

is the same as:

select a, b, c from table group by a, b, c
Zenshai
Agreed, but would it be same as select c,b,a from table group by a,b,c
Dheer
+1  A: 

From a 'SQL the language' perspective the two constructs are equivalent and which one you choose is one of those 'lifestyle' choices we all have to make. I think there is a good case for DISTINCT being more explicit (and therefore is more considerate to the person who will inherit your code etc) but that doesn't mean the GROUP BY construct is an invalid choice.

I think this 'GROUP BY is for aggregates' is the wrong emphasis. Folk should be aware that the set function (MAX, MIN, COUNT, etc) can be omitted so that they can understand the coder's intent when it is.

The ideal optimizer will recognize equivalent SQL constructs and will always pick the ideal plan accordingly. For your real life SQL engine of choice, you must test :)

PS note the position of the DISTINCT keyword in the select clause may produce different results e.g. contrast:

SELECT COUNT(DISTINCT C) FROM myTbl;

SELECT DISTINCT COUNT(C) FROM myTbl;
onedaywhen
A: 
SELECT COUNT(DISTINCT C) FROM myTbl;

Incorrect syntax (MS SQL 2005).

I don't know, but this gives me a syntax error...

What is the error it gives? works fine on my installation.
Brettski
A: 

HI..

I have the table with the following data


empid   empname deptid   address
aa76    John  6      34567
aa75    rob      4      23456
aa71    smith  3      12345
aa74    dave  2       12345
a77  blake   2       12345
aa73    andrew  3      12345
aa90    sam   1      12345
aa72    will  6      34567
aa70    rahul  5      34567

I ve used the following queries:

select
deptid
,EMPID
,EMPNAME
,ADDRESS
from 
mytable
group by 1,2,3,4

Which gives the result:


deptid  empid  empname  address
1     aa90   sam   12345
2     aa74   dave    12345
2     aa77   blake    12345
3     aa71   smith    12345
3     aa73   andrew    12345
4     aa75    rob    23456
5     aa70   rahul    34567
6     aa76    John    34567
6     aa72    will    34567

and for the query:


select
distinct (deptid)
,EMPID
,EMPNAME
,ADDRESS
from 
mytable

the result set is:

deptid empid empname address   
1     aa90 sam     12345
2     aa74 dave 12345
2     aa77 blake 12345
3     aa71 smith 12345
3     aa73 andrew 12345
4     aa75  rob 23456
5     aa70 rahul 34567
6     aa72 will 34567
6     aa76  John 34567

In the second query though i've given DISTINCT for DEPTID..how come i got the duplicate DEPTID...

Could you explain this..

identifymecnu
That could depend on the SQL engine you are using. Where did you run this query? You could of asked this as a new question too.
Brettski
I've run this query on Teradate..Any way thank you very much for your suggestion
identifymecnu
That's because distinct is for the whole record not just one field in most databases. How would it know which of the two to use for the values for the other columns if not? If you want only one column to be distinct, you must use aggreaget functions and explicitly tell the datbase which of the two related records you want.
HLGEM