views:

93

answers:

4

I have a query that I want to execute that fastest possible.

Here it is:

select d.InvoiceDetailId,a.Fee,a.FeeTax
from InvoiceDetail d
LEFT JOIN InvoiceDetail a on a.AdjustDetailId = d.InvoiceDetailId

I put an ascending index on AdjustDetailId column

I then ran the query with 'Show Actual Execution Plan' and the result estimated subtree cost(off of the topmost select node) was 2.07

I then thought, maybe I can do something to improve this so I added a conditional to the left join like so:

select d.InvoiceDetailId,a.Fee,a.FeeTax
from InvoiceDetail d
LEFT JOIN InvoiceDetail a on a.AdjustDetailId is not null 
and a.AdjustDetailId = d.InvoiceDetailId

I re-ran and I got a subtree cost of .98. So I thought, great I made it twice as fast. Well I then clicked show client statistics and then clicked execute 4-5 times with both queries and believe it or not the first query averaged out to be faster. I don't get it. By the way the query returns 120K rows.

Any insight?

Maybe i get tainted results because of caching, but I don't know if that is the case or how to reset the caching.

EDIT: Okay I googled how to clear query cache so I added the following before the queries:

DBCC DROPCLEANBUFFERS  
DBCC FREEPROCCACHE

I then ran each query 5 times and the first query was still a little faster(13%). 1st Query: Client Processing time: 239.4 2nd Query: Client Processing time: 290

So I guess the question is, why do you think so? Could it be when the table quadruples in size that the second query will be faster? Or the left join is causing the query to hit the index twice so it will always be slower.

Please don't flame me, I'm just trying to get educated.

EDIT # 2: I need to get all the InvoiceDetails, not just the adjusted ones hence the left join.

EDIT # 3: The real problem I'm trying to solve with the query is to sum up all of the InvoiceDetail rows but at the same time adjust them as well. So ultimately it seems that the best query to perform is the following. I thought doing a join then adding the joined in table would be the only way but it seems that grouping by a conditional solves the problem most elegantly.

SELECT CASE WHEN AdjustDetailId IS NULL THEN InvoiceDetailId ELSE AdjustDetailId END AS InvoiceDetailId
  ,SUM(Fee + FeeTax) AS Fee
FROM dbo.InvoiceDetail d
GROUP BY CASE WHEN AdjustDetailId IS NULL THEN InvoiceDetailId ELSE AdjustDetailId END

Example: With the following rows InvoiceDetailId|Fee|FeeTax|AdjustDetailId

1|300|0|NULL

2|-100|0|1

3|-50|0|1

4|250|0|NULL

My desire was to get the following: InvoiceDetailId|Fee 1|150

4|250

Thanks everybody for your input.

+5  A: 

If you want to make that query really fast, you need to

  • turn the LEFT JOIN into an INNER JOIN
  • make sure the InvoiceDetail.AdjustDetailId and InvoiceDetail.InvoiceDetailId are indexed

    SELECT 
      d.InvoiceDetailId, a.Fee, a.FeeTax
    FROM 
      dbo.InvoiceDetail d
    INNER JOIN 
      dbo.InvoiceDetail a ON a.AdjustDetailId = d.InvoiceDetailId
    

Next, you need to make sure your statistics are up to date, so that the cost-based query optimizer can work properly.

In order to update the statistics, use the UPDATE STATISTICS (table) command - see the MSDN docs on UPDATE STATISTICS here

marc_s
The problem with that is that the inner join will remove all of the InvoiceDetail rows with no adjustment. AdjustDetailId is nullable by design. That would probably make me do a inner join with a union where AdjustDetailId is null
Jose
true - but an outer join is always much slower than an INNER JOIN - if you really need it, yes, use a LEFT OUTER JOIN - but the other recommendations still apply
marc_s
how do you make your statistics up to date?
Jose
+2  A: 

I would have guessed that they would be the same, (with the same execution plan) since it is impossible for a predicate like a.AdjustDetailId = d.InvoiceDetailId to be true if one side is null... So adding the Is Not Null condition is redundant. But maybe the processor is executing additional unnecessary steps with that additional predicate in there...

But what the other answer mentions is more important. Do you really need to output all the rows where there is no matching record (Invoices without a Adjusting Invoice) ?? If not change it to an Inner join and it will speed up a lot.

if you really need them, however, You might try a Union

  Select d.InvoiceDetailId,a.Fee,a.FeeTax
  From InvoiceDetail d
     Join InvoiceDetail a 
         On a.AdjustDetailId = d.InvoiceDetailId
  Union
  Select InvoiceDetailId, null, null
  from InvoiceDetail 
  Where AdjustDetailId Is Null

Which does the same thing without using an outer join... (It is problematic as to whether two queries with a union will run faster than the single outer join query... )

Charles Bretana
That's exactly what it seems is my only alternative, I will check it out
Jose
Union can be really slow because it removes duplicates. Union all is much faster.
Lluis Martinez
+1  A: 

You only have 1 table in this query, right?

If you use

select InvoiceDetailId, Fee, FeeTax from InvoiceDetail

That WILL get all the rows, not just the adjusted ones.

Asuming you are doing a self-join, and doing it for a good reason, I would index InvoiceDetailId and AdjustDetailId and see which index(es) the execution plan uses.

You could also try "include" the Fee and FeeTax columns in your index - this will help a lot if the table is really wide.

Brad
+1  A: 

For your queries, I can think of 3 different reasonable execution plans:

LOOP JOIN OUTER [a.AdjustDetailId = d.InvoiceDetailId]
    TABLE SCAN InvoiceDetail d
    TABLE SCAN InvoiceDetail a

HASH JOIN OUTER [a.AdjustDetailId = d.InvoiceDetailId]
    TABLE SCAN InvoiceDetail d
    TABLE SCAN InvoiceDetail a

LOOP JOIN OUTER
    HASH JOIN OUTER [x.AdjustDetailId = d.InvoiceDetailId] AS y
        TABLE SCAN InvoiceDetail d
        INDEX SEEK [InvoiceDetail, AdjustDetailId IS NOT NULL] x
    InvoiceDetail a [a.AdjustDetailId = y.AdjustDetailId]

Perhaps adding the IS NOT NULL condition makes the optimizer choose another one of the plans, it's hard to say.

erikkallen