views:

3530

answers:

2

What are the performance gains/losses between hash joins and merge joins, specifically in Oracle RDBMS?

+4  A: 

Merge join is the best possible as it exploits the ordering, resulting in a single pass down the tables to do the join. IF you have two tables (or covering indexes) that have their ordering the same such as a primary key and an index of a table on that key then a merge join would result if you performed that action.

Hash join is the next best, as it's usually done when one table has a small number (relatively) of items, its effectively creating a temp table with hashes for each row which is then searched continuously to create the join.

Worst case is nested loop which is order (n * m) which means there is no ordering or size to exploit and the join is simply, for each row in table x, search table y for joins to do.

Spence
If one were always better than the other, then the other would never be used, don't you think? I think the difference is more complex than this.
David Aldridge
I'm sorry if you misunderstood me. I was trying to describe the types of joins and why a merge join is best. The problem is that merge join only works if you have sort order to exploit and a hash join is only more efficient when the joined table has a relatively small amount of rows in it. Apologies if that wasn't clear in the answer.
Spence
Nested loops is the worst of course, when nothing else is possible.
Spence
Nested loops isn't always the worst because nested loop joins return their first results very fast (low latency). When you use hint first_rows chances are high that a nested loop join will be used because the hint indicates that you want low latency. Sometimes the user prefers high throughput, (batch process for instance), sometimes the user prefers low latency (in an interactive UI for instance).
Theo
I suppose if the result set is small thats a fair comment.
Spence
+5  A: 

A "sort merge" join is performed by sorting the two data sets to be joined according to the join keys and then merging them together. The merge is very cheap, but the sort can be prohibitively expensive especially if the sort spills to disk. The cost of the sort can be lowered if one of the data sets can be accessed in sorted order via an index, although accessing a high proportion of blocks of a table via an index scan can also be very expensive in comparison to a full table scan.

A hash join is performed by hashing one data set into memory based on join columns and reading the other one and probing the hash table for matches. The hash join is very low cost when the hash table can be held entirely in memory, with the total cost amounting to very little more than the cost of reading the data sets. The cost rises if the hash table has to be spilled to disk in a one-pass sort, and rises considerably for a multipass sort.

You should note that hash joins can only be used for equi-joins, but merge joins are more flexible.

In general, if you are joining large amounts of data in an equi-join then a hash join is going to be a better bet.

This topic is very well covered in the documentation.

http://download.oracle.com/docs/cd/B28359_01/server.111/b28274/optimops.htm#i51523

David Aldridge