views:

115

answers:

7

I want to know which of the 2 queries below is faster :-

Select s.*,
       sm.* 
  from tblStudent s
Inner Join (SELECT studentId,SUM(marks) As Sum
              from tblStudentsMarks  
           Group By studentId) as sm on s.StudentID = sm.StudentID;

...or:

Select s.studentId, 
       s.Name,
       SUM(Marks)
  From tblStudent s
Inner Join tblStudentsMarks sm On s.Studentid = sm.StudentId
  Group By s.studentId, s.Name;

EDIT :-

Query Estimation of 1st Query :- http://img28.imageshack.us/img28/166/1stpicd.jpg

Query Estimation of 2nd Query :- http://img245.imageshack.us/img245/5064/2ndpic.jpg

Thanks in advance :)

+4  A: 

You can run both of them and use execution plan to compare each of them.

Sachin Shanbhag
A: 

One reason your second query is better is because you are not using the
* wild-card.
explaination to grok

Terrance
+1  A: 

Depends on your data and indexes. Generally, SQL Server is said to be better at optimizing joins than subqueries, and since it's also more readable (and thus more maintainable and less likely to cause bugs), I'd go with option 2 (join) for now, and see if you hit any performance roadblocks. If this is a speed critical query, I'd try both and compare the results. Make sure you use realistic data for testing though.

tdammers
@tdammers :- But in 2nd query, i had to do group by on 2 columns. Won't that affect performance? What if the 2nd query contains 20 columns? I mean say tblStudents contains 20 columns so i have to group by on all of them. Isn't the 1st option more stable since it requires only 1 group by clause even if you increase your tblstudent columns?.
Ankit Rathod
You shouldn't have to group on both columns in the second query according to ANSI SQL—assuming `studentId` is the primary key for `tblStudent`. `s.Name` has a ‘functional dependency’ on `s.studentId` so including it in the group clause is redundant (and will likely have no effect on the query plan).
bobince
@bobince :- If i don't do group by on both columns, i get this :-`Msg 8120, Level 16, State 1, Line 11Column 'tblStudent.Name' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.` What should i do next?
Ankit Rathod
Really, SQL Server doesn't abide by the standard? That's highly disappointing. Well... it still shouldn't make a difference to the query plan, as grouping by the primary key should already fully index. Just more annoying cruft to type out if you need to be compatible with T-SQL, I guess.
bobince
@bobince : I don't know what you talking about but even in Oracle 9i we require to group on all columns that are present in the `Select` column list.
Ankit Rathod
Oracle is known not to support this SQL:1999 requirement, but I'm disappointed if SQL Server doesn't either. See http://www.oreillynet.com/databases/blog/2007/05/debunking_group_by_myths.html under “what does the standard say” for some background.
bobince
A: 

Measure them. Whichever takes less time is faster.

Nathon
Oh my God. Really?
Ankit Rathod
@Nitesh - why didn't you do that then?
Kris C
+2  A: 

After reviewing the Execution Plans, the first query is more efficient.
Both have a table scan on tblStudentMarks, but the percentage is much lower on the first query compared to the second one. The SORT isn't ideal, but being that the table scan is in both - probably easier to deal with by adding an index. Both the SORT and table scan could be improved by reviewing the indexes...

OMG Ponies
@OMG Ponies : I expected that answer :). Some were advising me to use column names instead of * and few others that joins are much faster than subquery. But infact the 1st query was efficient.
Ankit Rathod
@Nitesh Panchal: Subqueries aren't the villian they're made out to be, but the only real way to know what is better is to look at the query/execution plan to see how the optimizer deals with the queries. To me, the first query doesn't use a subquery - it's a derived table/inline view, but that's just a terminology issue.
OMG Ponies
@OMG Ponies: I agree with your recommendations, but I'm not sure comparing the percentages is reliable. Since the percentages represent a value compared to the total estimated cost, we can't compare the percents to values in other queries. If the total estimated cost for the first query is many times greater than the estimated cost for the second, then the percentages are not comparable. We would like to see the estimated costs for the queries to do a better analysis. But again, analyzing the indexes is definitely in order.
bobs
@bobs: I hear ya. My main issue is the queries aren't returning identical results... I haven't put much stock in estimated plans, preferring Actual plans...
OMG Ponies
@OMG - If we assume that the table scan of `tblStudentMarks` has a constant cost of 1 unit then the fact that it takes 18% of the first query and 51% in the second makes the first query worse not better as that would mean the first query had an overall cost of 5.5 units and the second query a cost of 2 units.
Martin Smith
@Martin Smith: I think you got the numbers backwards; I'd want less dependence on a table scan, not more...
OMG Ponies
@OMG - Actually it is impossible for us to know which is better as we have no idea of the cardinality. A scan on the inside of a nested loops join isn't good but I'm pretty sure that the OP must have put a miniscule amount of data in both tables for SQL Server to consider that an option. If the table scan is only being executed once the figures above apply. By messing around with the cardinality I could reach any conclusion as to which plan was best http://img18.imageshack.us/img18/3977/executionplan.jpg
Martin Smith
A: 

To speed up, don't use Select *. Try also Left Join instead on Join (only if produce the same result)
Choose the one that is faster on your production server (depending on the data volume, index fragmentation and index selectivity the query optimizer can choose a different plan on production compared with the test server). When you test don't forget to empty the cache between runs. Test on real amount of data because the plan can change in time.
In real life ( if is fast enough ) I will go always for the second solution because is more readable and maintainable.

dragos55
A: 

Your example execution plans seem somewhat unrealistic. You have no useful indexes at all on tblStudentsMarks and I suspect can't have added many rows at all to either of the two tables. Setting up a possibly more realistic test

CREATE TABLE tblStudent(
studentId INT identity(1,1) primary key,
Name varchar(50),
filler char(2000));

create nonclustered index ix on tblStudent (StudentId, Name);

Insert into tblStudent (Name)
select top 10000 newid()
from sys.objects s1,sys.objects s2,sys.objects s3,sys.objects s4;

CREATE TABLE tblStudentsMarks(
examid int not null
,studentId INT foreign key references tblStudent not null
,marks decimal(5,2) not null
,primary key clustered (studentId, examid))

insert into tblStudentsMarks
select abs(checksum(newid())) as examid, studentId ,abs(checksum(newid()))/10000000 as marks
from tblStudent  cross join (select top 5 1 C from sys.objects) d
where studentid % 3 =0

Gives a different conclusion

Execution Plans

If we get rid of the * in Query 1 and replace it with Select s.studentId, s.Name, sm.Sum both of the query plans turn out exactly the same.

Martin Smith
@Martin Smith :- How can we make out from this diagram? I just have a basic understanding that there should be no or minimum amount of table scan in query execution plan. What else besides that we need to take care of?
Ankit Rathod