views:

585

answers:

4

What are the advantages, if any, of explicitly doing a HASH JOIN over a regular JOIN (wherein SQL Server will decide the best JOIN strategy)? Eg:

select pd.*
from profiledata pd
inner hash join profiledatavalue val on val.profiledataid=pd.id

In the simplistic sample code above, I'm specifying the JOIN strategy, whereas if I leave off the "hash" key word SQL Server will do a MERGE JOIN behind the scenes (per the "actual execution plan").

A: 

The logical plan optimizator doesn't assure to you that it finds the optimal solution: an exact algorithm is too slow to use in a production server; instead there are used some greedy algorithms.

Hence, the rationale behind those commands is to let the user specify the optimal join strategy, in the case the optimizator can't sort out what's really the best to adopt.

akappa
+1  A: 

Hash joins parallelize and scale better than any other join and are great at maximizing throughput in data warehouses.

CodeToGlory
+3  A: 

The optmiser does a good enough job for everyday use. However, in theory it might need 3 weeks to find the perfect plan in the extreme, so there is a chance that the generated plan will not be ideal.

I'd leave it alone unless you have a very complex query or huge amounts of data where it simply can't produce a good plan. Then I'd consider it.

But over time, as data changes/grows or indexes change etc, your JOIN hint will becomes obsolete and prevents an optimal plan. A JOIN hint can only optimise for that single query at the time of development with that set of data you have.

Personally, I've never specified a JOIN hint in any production code.

I've normally solved a bad join by changing my query around, adding/changing an index or breaking it up (eg load a temp table first). Or my query was just wrong, or I had an implicit data type conversion, or it highlighted a flaw in my schema etc.

I've seen other developers use them but only where they had complex views nested upon complex views and they caused later problems when they refactored.

Edit:

I had a conversion today where some colleagues are going to use them to force a bad query plan (with NOLOCK and MAXDOP 1) to "encourage" migration away from legacy complex nested views that one of their downstream system calls directly.

gbn
A: 

The only hint I've ever seen in shipping code was OPTION (FORCE ORDER). Stupid bug in SQL query optimizer would generate a plan that tried to join an unfiltered varchar and a unique identifier. Adding FORCE ORDER caused it to run the filter first.

I know, overloading columns is bad. Sometimes, you've got to live with it.

Joshua
Edit I'm about to add an OPTION (MAXDOP 1) to prevent a background worker from chewing up all the processor power.
Joshua