views:

167

answers:

4

I have executed a query and included the Actual Execution Plan. There is one Hash Match that is of interest to me because it's subtree uses a Index Scan instead of an index seek. When I mouse over this Hash Match there is a section called "Probe Residual". I had assumed that this is whatever values I am joining on. Am I correct here or is there a better explanation of what that means?

The second question I had is regarding the indexes it uses. In my example I am pretty sure this particular join is joining on two columns. The index that it is Scanning has both of these columns in it as well as another column that is not used in the join. I was under the impression that this would result in an Index Seek rather than a Scan. Am I mistaken on this?

+3  A: 

This blog post will probably answer your first question.

As for your second, index scans might be selected by the optimizer in a number of situations. Off the top of my head:

  • If the index is very small
  • If most of the rows in the index will be selected by the query

  • If you are using functions in the where clause of your query

For the first two cases, it's more efficient to do a scan, so the optimizer chooses it over a seek. For the third case, the optimizer has no choice.

womp
Very good article, thank you for posting it. So is he saying that if the first column indexed is not being joined on by my query that it could result in an index scan rather than a seek?
Abe Miessler
Yes. Btw, his blog is really good for learning about the internal workings of sql server.
womp
Yeah, i was impressed with what I read there. Adding him to my list. Thanks for pointing me towards him!
Abe Miessler
+1  A: 

1/ A Hash Match means that it takes a hash of columns used in an equality join, but needs to include all the other columns involved in the join (for >, etc) so that they can be checked too. This is where residual columns come in.

2/ An Index Seek can be done if it can go straight to the rows you want. Perhaps you're applying a calculation to the columns and using that? Then it will use the index as a smaller version of the data, but will still need to check every row (applying the calculation on each one).

Rob Farley
+1  A: 

A Hash Join will generally (always?) use a scan or at least a range scan. A hash join works by scannig both left and righ join tables (or a range in the tables) and building an in-memory hash table that contains all values 'seen' by the scans.

What happened in your case is this: the QO noticed that it can obtain all the values of a column C from a non-clustered index that happens to contain this column (as a key or as an included column). Being a non-clustered index is probably fairly narrow, so the total amount of IO to scan the entire non-clustered index is not exagerate. The QO also considered that the system has enough RAM to store a hash table in memory. When compared the cost of this query (a scan of a non-clustered index end-to-end for, say, 10000 pages) with the cost of a nested loop that used seeks (say 5000 probes at 2-3 pages each) the scan won as requiring less IO. Of course, is largely speculation on my part, but I'm trying to present the case from the QO point of view, and the plan is likely optimal.

Factors that contributed to this particular plan choice would be:

  • a large number of estimated candidates on the right side of the join
  • availability of the join column in a narrow non-clustered index for the left side
  • plenty of RAM

For a large estimate of the number of candidates, a better choice than the hash join is only the merge-join, and that one requires the input to be presorted. If both the left side can offer an access path that guarantees an order on the joined column and the right side has a similar posibility then you may end up with the merge join, which is the fastes join.

Remus Rusanu
A Hash Match doesn't necessarily use a Scan. It can easily involve a Seek to particular records and then use the results of that Seek in the Hash Match. For a Nested Loop, it's handling one record at a time, so is more likely to prefer a Seek, but that doesn't mean a Hash will prefer a scan - it just needs to get all the rows that are potential matches. If you're filtering on both tables involved and have a covering index but also a calculation, you can reproduce this behaviour.
Rob Farley
@Rob: I'm not sold on that. Took me awhile to find a public available ref on it, but read http://blogs.msdn.com/craigfr/archive/2006/08/10/687630.aspx on how Hash-Join works, both the build and the probe phase *read the entire input in one pass* which kind of rules out seeks. Also the pseudo-algorithm clearly states that there is no correlation between left and right side that determines the probe filtering.
Remus Rusanu
Right... let's consider setup first. Create two tables, with two fields each. Index one on the filterfield, including the joinfield column. Next we'll populate them with numbers. create table dbo.table1 (id int identity(1,1) primary key ,joinfield int ,filterfield int );gocreate table dbo.table2 (id int identity(1,1) primary key ,joinfield int ,filterfield int );gocreate index ix1 on dbo.table1(filterfield) include (joinfield);create index ix2 on dbo.table2(filterfield) include (joinfield);go
Rob Farley
Here's the population script. insert dbo.table1 (joinfield, filterfield)select number, numberfrom master..spt_valueswhere type = 'P';insert dbo.table2 (joinfield, filterfield)select number, numberfrom master..spt_valueswhere type = 'P';go/* *Now the key part* I join on the joinfield, but I filter first. The filter causes a seek, even though the join uses a hash. If it doesn't, use INNER HASH JOIN */select *from dbo.table1 t1 inner join dbo.table2 t2 on t1.joinfield = t2.joinfieldwhere t1.filterfield between 100 and 200and t2.filterfield between 100 and 200;
Rob Farley
@Rob: but those are range seeks access, both on left side and right side children of the hash operator. My point is that hash join will not be used with exact (key match) seek, it will always get the input as a scan (be it range scan, ie. seek followed by scan).
Remus Rusanu
Yeah - it's a seek which returns a range, which is then used in the Hash Match. The asker was concerned about an Index Scan in the subtree of a Hash Match, and your answer inferred that a Hash Match required it to be an Index Scan there, when that's not the case. As my first comment said "It can easily involve a Seek to particular records and then use the results of that Seek in the Hash Match."
Rob Farley
+1  A: 

Check out those excellent articles on execution plans on simple-talk.com:

They also have a free e-book SQL Server execution plans for download.

marc_s