views:

105

answers:

2

I was reading an article about how query expressions defer executions. Does that mean when we have a collection like:

IEnumerable<int> collection = from i in integers where i % 2 == 0 select i;

It is gonna be recalculated every time the collection is accessed?

If so, what's the general practice to deal with this? To convert into a new collection?

Also why did the C# designers chose it this way, and not something where it caches the result into the collection after the first access to the collection?

Also how does the runtime know that collection behaves this way (defer executions), unlike another IEnumerable<T> I might create using a List<T> which doesn't defer executions?


Edit:

What about cases like this:

List<int> ints = new List<int> ( ) { 1, 2, 3, 4, 5, 6, 7, 8, 9, 0 };

var even = from i in ints where i % 2 == 0 select i;

ints.AddRange ( new List<int> ( ) { 10, 20, 30, 40, 50 } );

foreach ( int i in even )
{
    Console.WriteLine ( i );
}


Output:

2, 4, 6, 8, 0, 10, 20, 30, 40, 50

By caching, would the behavior be more expected?

+2  A: 

Yes, it will be recalculated every time.

If you want to cache it, use ToArray() on the result (or ToList() if you want to add new items later).

The reason why it does not cache it after the first try is because it's fairly tricky to do it right in a thread-safe way. Normally, enumerating a collection is considered a thread-safe operation. However, if the implementation would cache items as they're being enumerated, it would either become non-safe for multiple readers (surprising to users and highly undesirable), or do explicit locking on the cache on every step, which significantly affects performance. And that performance penalty would be there even when you actually don't have several threads reading collection at the same time - locks are inherently expensive.

Another point is that the caching behavior is not very obvious when query expression referenced mutable variables that were changed between the point it was executed, and the point where result is enumerated. When you do eager evaluation, the values are what they were at the query point. When you do non-caching deferred evaluation, the values are always what they are at the enumeration point. With caching, you do not know - it depends on whether it's the first time you enumerate it or not.

Yet another point is that it cannot really know whether caching is needed or not. If you're only going to enumerate the sequence once (which is a very frequent case), then caching items wastes both time and memory.

F# actually has a standard type for such a lazy caching sequence, which is called LazyList - you can wrap any IEnumerable into that to get the desired semantics, with all the caveats described above.

Pavel Minaev
Thanks Pavel. In my edit, was this what you mean by your 4th paragraph?
Joan Venge
I'm not sure. Looking at your edit, it seems that by "caching" you actually mean just "caching values when enumeration begins", but rather "fully executing the query at the point it is performed and storing the result". The paragraph you refer to in my reply talks about the former case and not the latter one.
Pavel Minaev
+2  A: 

I find it easier to think of an IEnumerable<T> as a sequence rather than a collection, which is the terminology F# uses. Fundamentally, all an IEnumerable promises is that it can return an IEnumerator, which itself provides a simple contract that happens to have a fancy language construct to make it easy to use: foreach.

So instead of thinking of LINQ as filtering your collection, think of the methods as returning sequences that will meet your criteria when they are enumerated. When your query is compiled into this...

IEnumerable<int> collection = integers.Where(i => i % 2 == 0);

...think of collection as a sequence of the values in integers that are even, rather than a "collection" of such integers. I would even rename collection to something more accurate like evenIntegers.

To answer your specific questions:

  1. it's gonna be recalculated every time the collection is accessed?

    Each call to collection.GetEnumerator() will return a new enumerator, yes. In fact, if you enumerate collection and then update integers, enumerating collection again will produce a different result. And calling additional deferred LINQ operators would just return new sequences that operate on the filtered collection sequence, again without actually calculating anything until they are enumerated.

  2. If so, what's the general practice to deal with this? To convert into a new collection?

    In general, you should defer execution as long as possible. By chaining various LINQ methods, it's possible to build up a really "smart" sequence that won't actually do anything until you use foreach (or First, Last, an aggregation like Count, etc).

  3. Also why did the C# designers chose it this way, and not something where it caches the result into the collection after the first access to the collection?

    The various immediate operators (ToArray, ToList, ToDictionary, ToLookup) were provided to support scenarios where an in-memory collection is needed. It's easy to cache a deferred sequence; it's impossible to un-defer a cached sequence.

  4. Also how does the runtime know that collection behaves this way (defer executions), unlike another IEnumerable I might create using a List which doesn't defer executions?

    As I suggested before, the runtime doesn't really know anything about how a particular IEnumerable<T> is going to behave. It just knows that it will provide an enumerator. It's up to the individual implementer (like List<T>) to decide how it wants behave.

dahlbyk
For an interesting example of a sequence that's not a "real" collection, check out Continuous LINQ: http://www.codeplex.com/clinq
dahlbyk