tags:

views:

199

answers:

3

I am curious on how exactly LINQ (not LINQ to SQL) is performing is joins behind the scenes in relation to how Sql Server performs joins.

Sql Server before executing a query, generates an Execution Plan. The Execution Plan is basically an Expression Tree on what it believes is the best way to execute the query. Each node provides information on whether to do a Sort, Scan, Select, Join, ect.

On a 'Join' node in our execution plan, we can see three possible algorithms; Hash Join, Merge Join, and Nested Loops Join. Sql Server will choose which algorithm to for each Join operation based on expected number of rows in Inner and Outer tables, what type of join we are doing (some algorithms don't support all types of joins), whether we need data ordered, and probably many other factors.

Join Algorithms:

Nested Loop Join: Best for small inputs, can be optimized with ordered inner table.

Merge Join: Best for medium to large inputs sorted inputs, or an output that needs to be ordered.

Hash Join: Best for medium to large inputs, can be parallelized to scale linearly.

LINQ Query:

DataTable  firstTable, secondTable;

...

var rows = from firstRow in firstTable.AsEnumerable ()
                join secondRow in secondTable.AsEnumerable ()
                    on firstRow.Field<object> (randomObject.Property)
                    equals secondRow.Field<object> (randomObject.Property)
           select new {firstRow, secondRow};

SQL Query:

SELECT *
FROM firstTable fT
    INNER JOIN secondTable sT ON fT.Property = sT.Property

Sql Server might use a Nested Loop Join if it knows there are a small number of rows from each table, a merge join if it knows one of the tables has an index, and Hash join if it knows there are a lot of rows on either table and neither has an index.

Does Linq choose its algorithm for joins? or does it always use one?

A: 

Linq to SQL does not send join hints to the server. Thus the performance of a join using Linq to SQL will be identical to the performance of the same join sent "directly" to the server (i.e. using pure ADO or SQL Server Management Studio) without any hints specified.

Linq to SQL also doesn't allow you to use join hints (as far as I know). So if you want to force a specific type of join, you'll have to do it using a stored procedure or the Execute[Command|Query] method. But unless you specify a join type by writing INNER [HASH|LOOP|MERGE] JOIN, then SQL Server always picks the type of join it thinks will be most efficient - it doesn't matter where the query came from.

Other Linq query providers - such as Entity Framework and NHibernate Linq - will do exactly the same thing as Linq to SQL. None of these have any direct knowledge of how you've indexed your database and so none of them send join hints.

Linq to Objects is a little different - it will (almost?) always perform a "hash join" in SQL Server parlance. That is because it lacks the indexes necessary to do a merge join, and hash joins are usually more efficient than nested loops, unless the number of elements is very small. But determining the number of elements in an IEnumerable<T> might require a full iteration in the first place, so in most cases it's faster just to assume the worst and use a hashing algorithm.

Aaronaught
+1  A: 

LINQ itself does not chose algorithms of any kind, as LINQ, strictly speaking, is simply a way of expressing a query in SQL-like syntax that can map to function calls on either IEnumerable<T> or IQueryable<T>. LINQ is entirely a language feature and does not provide functionality, only another way of expressing existing function calls.

In the case of IQueryable<T>, it's entirely up to the provider (such as LINQ to SQL) to chose the best method of producing the results.

In the case of LINQ to Objects (using IEnumerable<T>), simple enumeration is what's used (roughly equivalent to nested loops) in all cases. There is no deep inspection (or even knowledge of) the underlying data types in order to optimize the query.

Adam Robinson
This actually isn't entirely correct - the Linq to Objects `JoinIterator` uses an internal `Lookup<TKey, TValue>`, which is closer to a hash join. Although for some reason they claim that it's [actually a nested loop in Linq to XML](http://msdn.microsoft.com/en-us/library/bb387080.aspx).
Aaronaught
+1  A: 

The methods on System.Linq.Enumerable are performed in the order they are issued. There is no query optimizer at play.

Many methods are very lazy, which allows you to not fully enumerate the source by putting .First or .Any or .Take at the end of the query. That is the easiest optimization to be had.

For System.Linq.Enumerable.Join specifically, the docs state that this is a hash join.

The default equality comparer, Default, is used to hash and compare keys.

So examples:

//hash join (n+m) Enumerable.Join
from a in theAs
join b in theBs on a.prop equals b.prop

//nestedloop join (n*m)  Enumerable.SelectMany
from a in theAs
from b in theBs
where a.prop == b.prop
David B