views:

528

answers:

4

I have seen that sometimes the performance of LINQ to Objects queries can be improved significantly if they forced to execute immediately by using .ToArray(), but can't quite understand why. For example, in the sample below, the execution of the function Deferred() is much slower than the function Immediate(), and grows exponentially with the value of limit (perhaps it was exponential in both functions, but execution time with Immediate() was too little for me to say definitively).

public void Deferred()
{
    var all = Range(limit);
    var even = from e in EvenRange(limit) where all.Contains(e) select e;
    var odd = from o in OddRange(limit) where !even.Contains(o) select o;

    var query = from q in odd select q;

    foreach(var i in query) { var j = i+1; }
}

public void Immediate()
{
    var all = Range(limit);
    var even = (from e in EvenRange(limit) where all.Contains(e) select e)
        .ToArray();
    var odd = (from o in OddRange(limit) where !even.Contains(o) select o)
        .ToArray();

    var query = (from q in odd select q).ToArray();

    foreach(var i in query) { var j = i+1; }
}

public static IEnumerable<int> OddRange(int stop)
{
    for (int i = 1; i < stop; i+=2) yield return i;
}

public static IEnumerable<int> EvenRange(int stop)
{
    for (int i = 2; i < stop; i+=2) yield return i;
}

public static IEnumerable<int> Range(int stop)
{
    for (int i = 0; i < stop; ++i) yield return i;
}

What causes this slowdown? Is it just the extra code that the compiler generates and has to run every time I access the enumerable, or does Deferred() also go into some extra loops, changing the "Big O" complexity of the query?

+6  A: 

As far as I know IEnumerable doesn't cache the result, so your clause involving even.Contains (which must check all elements of even) forces a complete re-enumeration of even, thus showing the performance behavior you've noticed. Transforming even in an array or in a List should show you good performance when enumerating odd. Other .NET languages (such as F#) include methods to cache the results of an enumeration (if you're interested, you may look at F#'s Seq.cache method)

emaster70
How would a re-enumeration of 'even' be different from traversing the 'even' array, in the other function? Is it just the query re-execution that's slower? The Big-O complexity, is, perhaps, still the same.
sandesh247
The complexity is higher in re-enumerating "even" because enumerating "even" requires:-enumerating limit-enumerating every element of all until the given element is found (or until there're no more elements)-repeat the above for each element in oddcompared to a simple sequential O(n) operation on an array which requires comparing an element with those of a static, predetermined listBasically the big-O is much higher in the first case because you need to reconstruct several lists for each element in odd.
emaster70
+1  A: 

In addition to what emaster70 said, when reading values from an array an entire cache line is fetched into the system cache meaning that when looking into the arrays it goes much faster than if you had to fetch from main memory.

Morten Christiansen
+1  A: 

In this code

var all = Range(limit);    
var even = from e in EvenRange(limit) where all.Contains(e) select e;    
var odd = from o in OddRange(limit) where !even.Contains(o) select o;

since IEnumerables constructed from iterator blocks or LINQ queries are lazy, computing each element of 'odd' requires recomputing the entire 'even' sequence. As a result, the big-O of this code is awful. It may be instructive to use a very small 'limit' and put a Console.WriteLine() inside the loop in Range(), and watch the behavior.

Brian
How would this be different from traversing the entire 'even' array, in the other function? Has the big-O changed, or is it just the query re-execution slowing things down?
sandesh247
The big-O changed, because it must _recompute_ 'even' (which requires re-running all the calls to all.Contains()) - that is the main difference.
Brian
+1  A: 

As emaster70 explained, the Deferred version recalcuates the even collection for every call to even.Contains. That makes it degrade very quickly for higher values of limit.

The Immediate version is better, and if you also make the all collection immediate, like this:

var all = Range(limit).ToArray();

it gets about three times faster.

However, the Contains method for an array is an O(n) operation, so that doesn't scale well either. If the condition is equality, you should use a join instead of looking up each value using Contains.

If you need to make the lookup, you should use a HashSet instead of an array, as the Contains method for that is an O(1) operation.

For a limit of 10000, this is 100 times faster than the Immediate version:

public void Joined() {
 var all = Range(limit);
 var even = from e in EvenRange(limit) join a in all on e equals a select e;
 var evenSet = new HashSet<int>(even);
 var odd = from o in OddRange(limit) where !evenSet.Contains(o) select o;

 var query = from q in odd select q;

 foreach (var i in query) { var j = i + 1; }
}
Guffa