views:

292

answers:

5

Hello everybody ::- ). I have some doubts over how Enumerators work, and LINQ. Consider these two simple selects:

List<Animal> sel = (from animal in Animals 
                    join race in Species
                    on animal.SpeciesKey equals race.SpeciesKey
                    select animal).Distinct().ToList();

or

IEnumerable<Animal> sel = (from animal in Animals 
                           join race in Species
                           on animal.SpeciesKey equals race.SpeciesKey
                           select animal).Distinct();

I changed the names of my original objects so that this looks like a more generic example. The query itself is not that important. What I want to ask is this:

foreach (Animal animal in sel) { /*do stuff*/ }
  1. I noticed that if I use IEnumerable, when I debug, if I inspect "sel", which in that case is the IEnumerable, it has some interesting members: "inner", "outer", "innerKeySelector" and "outerKeySelector", these last 2 appear to be delegates. The "inner" member does not have "Animal" instances in it, but rather "Species" instances, which was very strange for me. The "outer" member does contain "Animal" instances. I presume that the two delegates determine which goes in and what goes out of it?

  2. I noticed that if I use "Distinct", the "inner" contains 6 items (this is incorrect as only 2 are Distinct), but the "outer" does contain the correct values. Again, probably the delegated methods determine this but this is a bit more than I know about IEnumerable.

  3. Most importantly, which of the two options is the best performance-wise?

The evil List conversion via .ToList()?

Or maybe using the enumerator directly?

If you can, please also explain a bit or throw some links that explain this use of IEnumerable.

+11  A: 

The most important thing to realise is that, using Linq, the query does not get evaluated immediately. It is only run as part of iterating through the resulting IEnumerable<T> in a foreach - that's what all the wierd delegates are doing.

So, the first example evaluates the query immediately by calling ToList and putting the query results in a list.
The second example returns an IEnumerable<T> that contains all the information needed to run the query later on.

In terms of performance, the answer is it depends. If you need the results evaluated at once (say, you're mutating the structures you're querying later on, or you don't want the iteration over the IEnumerable<T> to take lots of time) use a list. Else use an IEnumerable<T>. The default should be to use the on-demand evaluation in the second example, as that generally uses less memory, unless there is a specific reason to store the results in a list.

thecoop
Hi and thanks for answering ::- ). This cleared up almost all my doubts. Any idea why the Enumerable is "split" into "inner" and "outer"? This happens when I inspect the element in debug/break mode via mouse. Is this perhaps Visual Studio's contribution? Enumerating on the spot and indicating input and output of the Enum?
Axonn
That's the `Join` doing it's work - inner and outer are the two sides of the join. Generally, don't worry about what's actually in the `IEnumerables`, as it will be completely different from your actual code. Only worry about the actual output when you iterate over it :)
thecoop
A: 

If all you want to do is enumerate them, use the IEnumerable.

Beware, though, that changing the original collection being enumerated is a dangerous operation - in this case, you will want to ToList first. This will create a new list element for each element in memory, enumerating the IEnumerable and is thus less performant if you only enumerate once - but safer and sometimes the List methods are handy (for instance in random access).

Daren Thomas
I'm not sure it's safe to say that generating a list means lower performance.
Steven Sudit
@ Steven: indeed as thecoop and Chris said, sometimes it may be necessary to use a List. In my case, I have concluded it isn't. @ Daren: what do you mean by "this will create a new list for each element in memory"? Maybe you meant a "list entry"? ::- ).
Axonn
@Axonn yes, I ment list entry. fixed.
Daren Thomas
@Steven If you plan to iterate over the elements in the `IEnumerable`, then creating a list first (and iterating over that) means you iterate over the elements *twice*. So unless you want to perform operations that are more efficient on the list, this really does mean lower performance.
Daren Thomas
Assuming we're just going to iterate over all of the results exactly once, there's no advantage to making a list unless (as you said) the operation benefits from random access. Generating the list always costs us *something*. My thought is that, if this is LINQ to SQL or if the processing is not trivial, then caching the results in a list allows us to pay once and then iterate over it as often as we like cheaply. As the overhead of list generation is fairly low, it's not hard to come up with cases where the benefits outweigh that cost. I hope this explains my thinking.
Steven Sudit
@Steven methinks we are arguing the same point ;)
Daren Thomas
Even worse, I think we're in agreement.
Steven Sudit
+2  A: 

The advantage of IEnumerable is deferred execution (usually with databases). The query will not get executed until you actually loop through the data. It's a query waiting until it's needed (aka lazy loading).

If you call ToList, the query will be executed, or "materialized" as I like to say.

There are pros and cons to both. If you call ToList, you may remove some mystery as to when the query gets executed. If you stick to IEnumerable, you get the advantage that the program doesn't do any work until it's actually required.

Matt Sherman
Thanks a lot! ::- ). Any idea about those "inner" / "outer" members of the Enum, as I inspect it via mouse in Visual Studio at debug/break time? Maybe VS iterates through the enum and tells me about input/output by itself?
Axonn
I'm afraid I don't know about those...
Matt Sherman
+1  A: 

A class that implement IEnumerable allows you to use the foreach syntax.

Basically it has a method to get the next item in the collection. It doesn't need the whole collection to be in memory and doesn't know how many items are in it, foreach just keeps getting the next item until it runs out.

This can be very useful in certain circumstances, for instance in a massive database table you don't want to copy the entire thing into memory before you start processing the rows.

Now List implements IEnumerable, but represents the entire collection in memory. If you have an IEnumerable and you call .ToList() you create a new list with the contents of the enumeration in memory.

Your linq expression returns an enumeration, and by default the expression executes when you iterate through using the foreach. An IEnumerable linq statement executes when you iterate the foreach, but you can force it to iterate sooner using .ToList().

Here's what I mean:

var things = 
    from item in BigDatabaseCall()
    where ....
    select item;

// this will iterate through the entire linq statement:
int count = things.Count();

// this will stop after iterating the first one, but will execute the linq again
bool hasAnyRecs = things.Any();

// this will execute the linq statement *again*
foreach( var thing in things ) ...

// this will copy the results to a list in memory
var list = things.ToList()

// this won't iterate through again, the list knows how many items are in it
int count2 = list.Count();

// this won't execute the linq statement - we have it copied to the list
foreach( var thing in list ) ...
Keith
Thank you for answering and providing all the examples ::- ).
Axonn
+2  A: 

IEnumerable describes behavior, while List is an implementation of that behavior. When you use IEnumerable, you give the compiler a chance to defer work until later, possibly optimizing along the way. If you use ToList() you force the compiler to reify the results right away.

Whenever I'm "stacking" LINQ expressions, I use IEnumerable, because by only specifying the behavior I give LINQ a chance to defer evaluation and possibly optimize the program. Remember how LINQ doesn't generate the SQL to query the database until you enumerate it? Consider this:

public IEnumerable<Animals> AllSpotted()
{
    return from a in Zoo.Animals
           where a.coat.HasSpots == true
           select a;
}

public IEnumerable<Animals> Feline(IEnumerable<Animals> sample)
{
    return from a in sample
           where a.race.Family == "Felidae"
           select a;
}

public IEnumerable<Animals> Canine(IEnumerable<Animals> sample)
{
    return from a in sample
           where a.race.Family == "Canidae"
           select a;
}

Now you have a method that selects an initial sample ("AllSpotted"), plus some filters. So now you can do this:

var Leopards = Feline(AllSpotted());
var Hyenas = Canine(AllSpotted());

So is it faster to use List over IEnumerable? Only if you want to prevent a query from being executed more than once. But is it better overall? Well in the above, Leopards and Hyenas get converted into single SQL queries each, and the database only returns the rows that are relevant. But if we had returned a List from AllSpotted(), then it may run slower because the database could return far more data than is actually needed, and we waste cycles doing the filtering in the client.

In a program, it may be better to defer converting your query to a list until the very end, so if I'm going to enumerate through Leopards and Hyenas more than once, I'd do this:

List<Animals> Leopards = Feline(AllSpotted()).ToList();
List<Animals> Hyenas = Canine(AllSpotted()).ToList();
C. Lawrence Wenham
Hi and thanks for answering ::- ). You gave me a very good example of how a case when clearly the IEnumerable case is performance-advantaged. Any idea regarding that other part of my question? Why the Enumerable is "split" into "inner" and "outer"? This happens when I inspect the element in debug/break mode via mouse. Is this perhaps Visual Studio's contribution? Enumerating on the spot and indicating input and output of the Enum?
Axonn
I think they refer to the two sides of a join. If you do "SELECT * FROM Animals JOIN Species..." then the inner part of the join is Animals, and the outer part is Species.
C. Lawrence Wenham
Oooh. Right ::- D. Probably so. Thank you again.
Axonn