Edit 02-Feb-2010:
As I see it, there are at least two ways to interpret this question.
Why does the Select<T,
TResult>
extension method, when
called on an instance of a class that
implements ICollection<T>
, not
return an object that provides a
Count
property; and why does the
Count<T>
extension method not
check for this property so as to
provide O(1) performance when the two
methods are chained?
This version of the question makes no false assumptions about how Linq extensions work, and is a valid question since a call to ICollection<T>.Select.Count
will, after all, always return the same value as ICollection<T>.Count
. This is how Mehrdad interpreted the question, to which he has provided a thorough response.
But I read the question as asking...
If the Count<T>
extension method provides O(1)
performance for an object of a class
implementing ICollection<T>
, why
does it provide O(n) performance for
the return value of the
Select<T, TResult>
extension method?
In this version of the question, there is a mistaken assumption: that the Linq extension methods work together by assembling little collections one after another (in memory) and exposing them through the IEnumerable<T>
interface.
If this were how the Linq extensions worked, the Select
method might look something like this:
public static IEnumerable<TResult> Select<T, TResult>(this IEnumerable<T> source, Func<T, TResult> selector) {
List<TResult> results = new List<TResult>();
foreach (T input in source)
results.Add(selector(input));
return results;
}
Moreover, if this were the implementation of Select
, I think you'd find most code that utilizes this method would behave just the same. But it would be wasteful, and would in fact cause exceptions in certain cases like the one I described in my original answer.
In reality, I believe the implementation of the Select
method is much closer to something like this:
public static IEnumerable<TResult> Select<T, TResult>(this IEnumerable<T> source, Func<T, TResult> selector) {
foreach (T input in source)
yield return selector(input);
yield break;
}
This is to provide lazy evaluation, and explains why a Count
property is not accessible in O(1) time to the Count
method.
So in other words, whereas Mehrdad answered the question of why Select
wasn't designed differently so that Select.Count
would behave differently, I have offered my best answer to the question of why Select.Count
behaves the way it does.
ORIGINAL ANSWER:
Method side effects is not the answer.
According to Mehrdad's answer:
It doesn't really matter that the
result of Select is lazily evaluated.
I don't buy this. Let me explain why.
For starters, consider the following two very similar methods:
public static IEnumerable<double> GetRandomsAsEnumerable(int N) {
Random r = new Random();
for (int i = 0; i < N; ++i)
yield return r.NextDouble();
yield break;
}
public static double[] GetRandomsAsArray(int N) {
Random r = new Random();
double[] values = new double[N];
for (int i = 0; i < N; ++i)
values[i] = r.NextDouble();
return values;
}
OK, what do these methods do? Each one returns as many random doubles as the user desires (up to int.MaxValue
). Does it matter whether either method is lazily evaluated or not? To answer this question, let's take a look at the following code:
public static double Invert(double value) {
return 1.0 / value;
}
public static void Test() {
int a = GetRandomsAsEnumerable(int.MaxValue).Select(Invert).Count();
int b = GetRandomsAsArray(int.MaxValue).Select(Invert).Count();
}
Can you guess what will happen with these two method calls? Let me spare you the trouble of copying this code and testing it out yourself:
The first variable, a
, will (after a potentially significant amount of time) be initialized to int.MaxValue
(currently 2147483647). The second one, b
, will very likely be interrupted by an OutOfMemoryException
.
Because Select
and the other Linq extension methods are lazily evaluated, they allow you to do things you simply could not do otherwise. The above is a fairly trivial example. But my main point is to dispute the assertion that lazy evaluation is not important. Mehrdad's statement that a Count
property "is really known from the start and could be optimized" actually begs the question. The issue might seem straightforward for the Select
method, but Select
is not really special; it returns an IEnumerable<T>
just like the rest of the Linq extension methods, and for these methods to "know" the Count
of their return values would require full collections to be cached and therefore prohibit lazy evaluation.
Lazy evaluation is the answer.
For this reason, I have to agree with one of the original responders (whose answer now seems to have disappeared) that lazy evaluation really is the answer here. The idea that method side effects need to be accounted for is really secondary, as this is already ensured as a byproduct of lazy evaluation anyway.
Postscript: I've made very assertive statements and emphasized my points mainly because I wanted to be clear on what my argument is, not out of any disrespect for any other responses, including Mehrdad's, which I feel is insightful but misses the mark.