views:

108

answers:

2

In general what makes an SQL query optimiser decide between a nested loop and a hash join.

A: 

Typically, it's going to be dependent on the size of the sets that are being joined.

I highly recommend reading "Inside Microsoft SQL Server 2008: T-SQL Querying" by Itzik Ben-Gan:

http://www.solidq.com/insidetsql/books/insidetsql2008/

(the 2005 edition is just as applicable on this topic as well)

He goes into your question, as well as many others when it comes to getting the most out of your queries.

casperOne
+5  A: 

NESTED LOOPS are good if the condition inside the loop is sargable, that is index can be used to limit the number of records.

For a query like this:

SELECT  *
FROM    a
JOIN    b
ON      b.b1 = a.a1
WHERE   a.a2 = @myvar

, with a leading, each record from a will be taken and all corresponding records in b should be found.

If b.b1 is indexed and has high cardinality, then NESTED LOOP will be a preferred way.

In SQL Server, it is also the only way to execute non-equijoins (something other than = condition in the ON clause)

HASH JOIN is the fastest method if all (or almost all) records should be parsed.

It takes all records from b, builds a hash table over them, then takes all records from a and uses the value of the join column as a key to look up the hash table.

  • NESTED LOOPS takes this time:

    Na * (Nb / C) * R,

    where Na and Nb are the numbers of records in a and b, C is the index cardinality, and R is constant time required for the row lookup (1 is all fields in SELECT, WHERE and ORDER BY clauses are covered by the index, about 10 if they are not)

  • HASH JOIN takes this time:

    Na + (Nb * H)

    , where H is sum of constants required to build and lookup the hash table (per record). They are programmed into the engine.

SQL Server computes the cardinality using the table statistics, computes and compares the two values and chooses the best plan.

Quassnoi
This is an excellent answer, I hope to test it as I'm going along.
cindi