views:

305

answers:

3

Using reflector I have noticed that System.Linq.Enumerable.Count method has a condition in it to optimize it for the case when the IEnumerable<T> passed is in fact an ICollection<T>. If the cast succeeds the Count method does not need to iterate over every element, but can call the Count method of ICollection.

Based on this I was starting to think that IEnumerable<T> can be used like a readonly view of a collection, without having the performance loss that I originally expected based on the API of IEnumerable<T>

I was interested whether the optimization of the Count still holds when the IEnumerable<T> is a result of a Select statement over an ICollection, but based on reflected code this case is not optimized, and requires an iteration through all elements.

Do you draw the same conclusions from reflector? What could be the reason behind the lack of this optimization? I seems like there is a lot of time wasted in this common operation. Does the spec require that the each element is evaluated even if the Count can be determined without doing that?

A: 

An ICollection knows the number of items (Count) it contains. It doesn't have to iterate any items to determine it. Take for example the HashSet class (which implements ICollection).

An IEnumerable<T> doesn't know how many items it contains. You have to enumerate the whole list to determine the number of items (Count).

Wrapping the ICollection in a LINQ statement, doesn't make it more efficient. No matter how you twist and turn, the ICollection will have to be enumerated.

Zyphrax
Using reflector you can look at the implementaion of the method Enumerable.Count. You will see that it attemps to cast to ICollection, and if that succeeds it calls the Count on the collection, so it doesn't need to iterate over it. The same would be possible with the iterator object that Select returns.
shojtsy
+7  A: 

It doesn't really matter that the result of Select is lazily evaluated. The Count is always equivalent to the count of the original collection so it could have certainly been retrieved directly by returning a specific object from Select that could be used to short-circuit evaluation of the Count method.

The reason it's not possible to optimize out evaluation of the Count() method on the return value of a Select call from something with determined count (like a List<T>) is that it could change the meaning of the program.

The selector function passed to Select method is allowed to have side effects and its side effects are required to happen deterministically, in a predetermined order.

Assume:

new[]{1,2,3}.Select(i => { Console.WriteLine(i); return 0; }).Count();

The documentation requires this code to print

1
2
3

Even though the count is really known from the start and could be optimized, optimization would change the behavior of the program. That's why you can't avoid enumeration of the collection anyway. That's exactly one of the reasons why compiler optimizations are much easier in pure functional languages.


UPDATE: Apparently, it's not clear that it's perfectly possible to implement Select and Count so that Selects on ICollection<T> will still be lazily evaluated but the Count() will be evaluated in O(1) without enumerating the collection. I'm going to do that without changing the interface of any methods. A similar thing is already done for ICollection<T>:

private interface IDirectlyCountable {
    int Count {get;}
}
private class SelectICollectionIterator<TSource,TResult> : IEnumerable<T>, IDirectlyCountable {
     ICollection<TSource> sequence;
     Func<TSource,TResult> selector;
     public SelectICollectionIterator(ICollection<TSource> source, Func<TSource,TResult> selector) {
         this.sequence = source;
         this.selector = selector;
     }
     public int Count { get { return sequence.Count; } }
     // ... GetEnumerator ... 
}
public static IEnumerable<TResult> Select<TSource,TResult>(this IEnumerable<TSource> source, Func<TSource,TResult> selector) {
    // ... error handling omitted for brevity ...
    if (source is ICollection<TSource>)
       return new SelectICollectionIterator<TSource,TResult>((ICollection<TSource>)source, selector);
    // ... rest of the method ...
}
public static int Count<T>(this IEnumerable<T> source) {
    // ...
    ICollection<T> collection = source as ICollection<T>;
    if (collection != null) return collection.Count;
    IDirectlyCountable countableSequence = source as IDirectlyCountable;
    if (countableSequence != null) return countableSequence.Count;
    // ... enumerate and count the sequence ...
}

This will still evaluate the Count lazily. If you change the underlying collection, the count will get changed and the sequence is not cached. The only difference will be not doing the side effects in the selector delegate.

Mehrdad Afshari
How about the indexer property of the ICollection having a side effect? Is it not a concern that the optimization currently implemented in the Count method avoids the call to the indexer property of the collection?
shojtsy
@shojtsy: The indexer is irrelevant and is never used by any of the LINQ methods, but your concern is valid as the `Count` property and the `GetEnumerator` method can have side effects too. That's why the documentation for the `Count()` method explicitly and clearly singles out `ICollection<T>` and says that it uses the `Count` property if the argument implements that interface instead of enumeration.If this was not clear in the doc, I would have expected `Count()` **not to** use `ICollection<T>.Count` either.
Mehrdad Afshari
Your answer may address this *exact* scenario (`Select` on an `ICollection`) directly, but I feel it is misleading because it could easily be mistakenly interpreted to suggest that this potential optimization were ever under serious consideration, and only dismissed so as to allow side effects. In my answer I'm trying to, first: explain why the Linq extensions *in general* use lazy evaluation; and second: point out that `Select` is *not* special, and works just like any other Linq extension.
Dan Tao
@Dan: I wonder why you may think this is not seriously considered by the team. Yes, LINQ definitely evaluates sequences lazily and `Select` is not special--no question about that, but *it could easily be special*. Your reasoning fails to explain why `Count` for `ICollection<T>` doesn't cause enumeration like on other sequences, it does...
Mehrdad Afshari
... What I'm saying here is that the spec singles it out. Likewise, it could easily single out other stuff as well and it comes down to a *design decision* that draws the line. And **given the current spec**, what removes the possibility of the *potential optimization* in an implementation is the difference it'll cause due to side effects. I'm not saying that this design decision is made because of side effects. Actually side effects are discouraged in that context.
Mehrdad Afshari
@Mehrdad: The more we discuss this the clearer it becomes to me that we read the question differently. I'm not really disagreeing with your argument; I think you're arguing the wrong issue. I've updated my answer to reflect that. In any case, it seems probable that you were correct in your interpretation of the OP's thinking, since yours is the accepted answer.
Dan Tao
+1  A: 

Edit 02-Feb-2010:

As I see it, there are at least two ways to interpret this question.

Why does the Select<T, TResult> extension method, when called on an instance of a class that implements ICollection<T>, not return an object that provides a Count property; and why does the Count<T> extension method not check for this property so as to provide O(1) performance when the two methods are chained?

This version of the question makes no false assumptions about how Linq extensions work, and is a valid question since a call to ICollection<T>.Select.Count will, after all, always return the same value as ICollection<T>.Count. This is how Mehrdad interpreted the question, to which he has provided a thorough response.

But I read the question as asking...

If the Count<T> extension method provides O(1) performance for an object of a class implementing ICollection<T>, why does it provide O(n) performance for the return value of the Select<T, TResult> extension method?

In this version of the question, there is a mistaken assumption: that the Linq extension methods work together by assembling little collections one after another (in memory) and exposing them through the IEnumerable<T> interface.

If this were how the Linq extensions worked, the Select method might look something like this:

public static IEnumerable<TResult> Select<T, TResult>(this IEnumerable<T> source, Func<T, TResult> selector) {
    List<TResult> results = new List<TResult>();

    foreach (T input in source)
        results.Add(selector(input));

    return results;
}

Moreover, if this were the implementation of Select, I think you'd find most code that utilizes this method would behave just the same. But it would be wasteful, and would in fact cause exceptions in certain cases like the one I described in my original answer.

In reality, I believe the implementation of the Select method is much closer to something like this:

public static IEnumerable<TResult> Select<T, TResult>(this IEnumerable<T> source, Func<T, TResult> selector) {
    foreach (T input in source)
        yield return selector(input);

    yield break;
}

This is to provide lazy evaluation, and explains why a Count property is not accessible in O(1) time to the Count method.

So in other words, whereas Mehrdad answered the question of why Select wasn't designed differently so that Select.Count would behave differently, I have offered my best answer to the question of why Select.Count behaves the way it does.


ORIGINAL ANSWER:

Method side effects is not the answer.

According to Mehrdad's answer:

It doesn't really matter that the result of Select is lazily evaluated.

I don't buy this. Let me explain why.

For starters, consider the following two very similar methods:

public static IEnumerable<double> GetRandomsAsEnumerable(int N) {
    Random r = new Random();

    for (int i = 0; i < N; ++i)
        yield return r.NextDouble();

    yield break;
}

public static double[] GetRandomsAsArray(int N) {
    Random r = new Random();

    double[] values = new double[N];
    for (int i = 0; i < N; ++i)
        values[i] = r.NextDouble();

    return values;
}

OK, what do these methods do? Each one returns as many random doubles as the user desires (up to int.MaxValue). Does it matter whether either method is lazily evaluated or not? To answer this question, let's take a look at the following code:

public static double Invert(double value) {
    return 1.0 / value;
}

public static void Test() {
    int a = GetRandomsAsEnumerable(int.MaxValue).Select(Invert).Count();
    int b = GetRandomsAsArray(int.MaxValue).Select(Invert).Count();
}

Can you guess what will happen with these two method calls? Let me spare you the trouble of copying this code and testing it out yourself:

The first variable, a, will (after a potentially significant amount of time) be initialized to int.MaxValue (currently 2147483647). The second one, b, will very likely be interrupted by an OutOfMemoryException.

Because Select and the other Linq extension methods are lazily evaluated, they allow you to do things you simply could not do otherwise. The above is a fairly trivial example. But my main point is to dispute the assertion that lazy evaluation is not important. Mehrdad's statement that a Count property "is really known from the start and could be optimized" actually begs the question. The issue might seem straightforward for the Select method, but Select is not really special; it returns an IEnumerable<T> just like the rest of the Linq extension methods, and for these methods to "know" the Count of their return values would require full collections to be cached and therefore prohibit lazy evaluation.

Lazy evaluation is the answer.

For this reason, I have to agree with one of the original responders (whose answer now seems to have disappeared) that lazy evaluation really is the answer here. The idea that method side effects need to be accounted for is really secondary, as this is already ensured as a byproduct of lazy evaluation anyway.

Postscript: I've made very assertive statements and emphasized my points mainly because I wanted to be clear on what my argument is, not out of any disrespect for any other responses, including Mehrdad's, which I feel is insightful but misses the mark.

Dan Tao
You don't seem to have read the question. Sure, for generic `IEnumerable<T>`, you'll have to traverse the list. We are specifically talking about `Select` on `ICollection<T>` where we **know** the count beforehand. The library already uses `ICollection<T>.Count` property instead of enumeration. The question is *why not do this for `Select`s on `ICollection<T>` too?*
Mehrdad Afshari
@Mehrdad: The OP said, "I was starting to think that `IEnumerable<T>` can be used like a readonly view of a collection." I've provided what I think is the fundamental reason this is not the case for the return value of any of the Linq extension methods, `Select` or otherwise.
Dan Tao
@Dan: It's important to realize that `IEnumerable<T>` is an *interface*. The return value from `Select` is an object of a type that *implements* `IEnumerable<T>`. As noted in my updated answer, it could provide `Count()` easily. Similarly, `Enumerable.Count` method in the framework takes `IEnumerable<T>` as its *formal parameter* but behaves differently if the argument also implements `ICollection<T>`. I have provided an example where you could still have lazy evaluation, and O(1) count, for `Select`s on `ICollection<T>`s so it's definitely possible.
Mehrdad Afshari
@Mehrdad: I have to admit you did prove me wrong when I said that for the return value of `Select` to provide a `Count` property would prohibit lazy evaluation. But that wasn't really my main point; it sounded to me (and still does) like the OP originally believed the Linq extension methods construct new collections and then wrap them up in the `IEnumerable<T>` interface for public exposure. And obviously, they *could* be implemented this way (`Select` could construct a new `List<T>`, for example); I was trying to steer him away from this misconception by explaining why this is not the case.
Dan Tao
@Dan, thanks for you answer. I didn't had the misconception that IEnumerables are pre-cached. Actually the misconception I had is that they are always lazy, so the suprise was the fast Count() for the special case. It is interesting to look at actual implementation for Select. For example List.Select creates a WhereSelectListIterator object which stores the List in a field, so would be perfectly capable to provide Count info. I now understand that O(1) Count would contradict the specs, although I can not always agree with the spec maker's design decissions.
shojtsy