views:

316

answers:

6

I am using some of the LINQ select stuff to create some collections, which return IEnumerable<T>.

In my case I need a List<T>, so I am passing the result to List<T>'s constructor to create one.

I am wondering about the overhead of doing this. The items in my collections are usually in the millions, so I need to consider this.

I assume, if the IEnumerable<T> contains ValueTypes, it's the worst performance.

Am I right? What about Ref Types? Either way there is also the cost of calling, List<T>.Add a million times, right?

Any way to solve this? Like can I "overload" methods like LINQ Select using extension methods)?

+6  A: 

It would be best to avoid the need for a list. If you can keep your caller using IEnumerable<T>, you will save yourself some headaches.

LINQ's ToList() will take your enumerable, and just construct a new List<T> directly from it, using the List<T>(IEnumerable<T>) constructor. This will be the same as making the list yourself, performance wise (although LINQ does a null check, as well).

If you're adding the elements yourself, use the AddRange method instead of the Add. ToList() is very similar to AddRange (since it's using the constructor which takes IEnumerable<T>), which typically will be your best bet, performance wise, in this case.

Reed Copsey
Good explaination.
Justin Niessner
Thanks Reed. When you said "If your IEnumerable<T> is already an IList<T>", you mean there are some methods in LINQ that returns IList instead?
Joan Venge
ToList should always return an independent copy of the data I believe - and IIRC t check right now as I'm on a phone and MSDN isn't great on this size of screen.
Jon Skeet
Rats, half that message got cut. It was meant to say that IIRC Enumerable.ToList *is* typed to return List<T>, not IList<T>.
Jon Skeet
@Jon, you are right:public static List<TSource> ToList<TSource>( this IEnumerable<TSource> source)
Joan Venge
@Jon: Thanks for pointing that out. I always forget, it does do a full construction, unless there is already a list there. I just checked, and will modify my answer.
Reed Copsey
What do you mean by "unless there is already a list there"? I'd expect Enumerable.ToList to *always* create a new, independent copy of the source data *regardless* of the type of the original sequence.
Jon Skeet
@Jon Skeet: But it uses List<T>(IEnumerable) to construct the list, which checks for ICollection, and makes shortcuts (the same ones you suggested, where you know the size in advance). If it's already a list, it uses the count to preallocate the array at exactly the right number of items, then uses CopyTo to copy the references (or values) across in one block. If it's not a list, it enumerates the elements, which is much slower. This is part of the nice thing about using new List<T>(enumerable), vs. just doing lots of Add calls.
Reed Copsey
That confused me too. Would the compiler know if your original sequence was List vs Array from the resulting IEnumerable?
Joan Venge
@Joan Venge: No. It's a runtime check. The List<T> constructor which takes an IEnumerable does a cast to ICollection<T> (ie: ICollection<T> coll = source as ICollection<T>), and if it's an ICollection<T>, it uses the Count property directly to preallocate, then uses CopyTo instead of enumerating all of the elements. This potentially makes it faster if the source is a list (or any other ICollection<T> implementation).
Reed Copsey
Thanks Reed. Now I got it.
Joan Venge
+1  A: 

Don't pass an IEnumerable to the List constructor. IEnumerable has a ToList() method, which can't possibly do worse than that, and has nicer syntax (IMHO).

That said, that only changes the answer to your question to "it depends" - in particular, it depends on what the IEnumerable actually is behind the scenes. If it happens to be a List already, then ToList will effectively be free, of course will go much faster than if it were another type. It's still not super-fast.

The best way to solve this, of course, is to try to figure out how to do your processing on an IEnumerable rather than a List. That may not be possible.


Edit: Some people in the comments are debating whether or not ToList() will actually be any faster when called on a List than if not, and whether ToList() will be any faster than the list constructor. At this point, speculating is getting pointless, so here's some code:

using System;
using System.Linq;
using System.Collections.Generic;

public static class ToListTest
{
    public static int Main(string[] args)
    {
        List<int> intlist = new List<int>();
        for (int i = 0; i < 1000000; i++)
            intlist.Add(i);

        IEnumerable<int> intenum = intlist;

        for (int i = 0; i < 1000; i++)
        {
            List<int> foo = intenum.ToList();
        }

        return 0;
    }
}

Running this code with an IEnumerable that's really a List goes about 6-10 times faster than if I replace it with a LinkedList or Stack (on my pokey 2.4 GHz P4, using Mono 1.2.6). Conceivably this could be due to some unfortunate interaction between ToList() and the particular implementations of LinkedList or Stack's enumerations, but at least the point remains: speed will depend on the underlying type of the IEnumerable. That said, even with a List as the source, it still takes 6 seconds for me to make 1000 ToList() calls, so it's far from free.

The next question is whether ToList() is any more intelligent than the List constructor. The answer to that turns out to be no: the List constructor is just as fast as ToList(). In hindsight, Jon Skeet's reasoning makes sense - I was just forgetting that ToList() was an extension method. I still (much) prefer ToList() syntactically, but there's no performance reason to use it.

So the short version is that the best answer is still "don't convert to a List if you can avoid it". Barring that, actual performance will depend drastically on what the IEnumerable actually is, but at best it'll be sluggish, as opposed to glacial. I've amended my original answer to reflect this.

C Pirate
IEnumerable<T> doesn't have ToList in itself - it's just that there's an extension method to do the conversion. That extension method can't have any more (or less) information than the List<T> constructor, so I fail to see any reason to suppose it would perform better or worse.
Jon Skeet
(And I don't believe that calling the ToList extension method on an existing List<T> will be free. It shouldn't be, because it should create an *independent copy* of the sequence's data.)
Jon Skeet
@Jon: in contrast to the constructor which MUST copy all items into the list, ToList can try to cast the IList<T> first and only actually create a new list if really required. Therefore, depending on that that IEnumerable<T> instance is, it may avoid copying anything altogether.
Lucero
@Lucero: I disagree. I'll have to check the docs, but I believe all the ToXXX methods must always create an independent copy of the sequence in its current state. Otherwise you'd have a very odd situation where *sometimes* changes to the original would be seen in the result, and sometimes not. I'm 99% sure the current implementation will always create a copy, at the very least.
Jon Skeet
@C Pirate: Btw when you said "depends on the IEnumerable instance", what do you mean? I mean if I filter a List<T> using Select, then the ToList method be free?
Joan Venge
I've just checked, and the docs aren't as clear as I'd like them to be - but ToList *is* declared to return List<T>, not just IList<T>. I also *strongly* believe that any implementation which *doesn't* create an independent copy would lead to widespread confusion due to the inconsistency.
Jon Skeet
Thanks Jon, by independent copy you mean deep copy?
Joan Venge
No, I mean that changing one list (e.g. to remove an item) won't affect the other.
Jon Skeet
+6  A: 

No, there's no particular penalty for the element type being value types, assuming you're using IEnumerable<T> instead of IEnumerable. You won't get any boxing going on.

If you actually know the size of the result beforehand (which the result of Select probably won't) you might want to consider creating the list with that size of buffer, then using AddRange to add the values. Otherwise the list will have to resize its buffer every time it fills it.

For instance, instead of doing:

Foo[] foo = new Foo[100];
IEnumerable<string> query = foo.Select(foo => foo.Name);
List<string> queryList = new List<string>(query);

you might do:

Foo[] foo = new Foo[100];
IEnumerable<string> query = foo.Select(x => x.Name);
List<string> queryList = new List<string>(foo.Length);
queryList.AddRange(query);

You know that calling Select will produce a sequence of the same length as the original query source, but nothing in the execution environment has that information as far as I'm aware.

Jon Skeet
Downvoters: please provide reasons...
Jon Skeet
Thanks Jon. For valuetypes, I thought they might be copied again for each element into the new list. Is it not the case? Also can you please explain your last sentence. I didn't get it. What about using "from i in collection where ... Select ?" It's the same, right?
Joan Venge
I upvoted btw Jon.
Joan Venge
Whether it's references, integers, floating point values or anything else, all the values have to be copied. If you're filtering in your query then you *don't* know the final size of the sequence, so I'd stick with your current code or call ToList. (Unless the filter is likely to only 'remove' a small proportion of entries.)
Jon Skeet
Thanks Jon, can you please tell me why refs are copied in this case? I figured if I modify an item in my 2nd list, the item in the first list would also be modified. Is it because the implementation is different in Select?
Joan Venge
The list *itself* wouldn't be modified just because you changed the contents of an object referred to by a reference within the list. Put it this way - if someone has a list of house addresses, does *that list* change if someone adds some furniture to a house?
Jon Skeet
Thanks Jon. I might have asked it wrong. But if I change the item in the resulting collection like: result[10].X = 5, would original[10].X changes too if T was RefType? If so, I don't want that :)
Joan Venge
@Joan: Yes, it will. If you're working with a reference type, it's creating a new List of references, but they still point to the same original objects. If you want different behavior than that, you'll need to explicitly replace the full object in your new list. Ie: result[10] = new ResultType(5, result[10].Y, result[10].Z, ...); Alternatively, you'd need to do some form of deep copy operation where you clone the elements instead of just copying references to them.
Reed Copsey
Thanks Reed. In that case, using Select on ref types vs value types would have different performance? (for copying)
Joan Venge
@Joan Venge: No. The two are pretty much identical. If you need a full, deep copy, you'll need to implement your own deep copying. With a reference type, that's going to mean 1 million constructor calls... Using Select isn't going to really change anything vs. just doing those directly.
Reed Copsey
Thanks Reed. But in the case of value types, the Select method will still just copy pointers and not the value itself? I thought value types would be brand new copies in this case.
Joan Venge
No, in every case the values in the sequence are copied - for value types, those values are the actual data (numbers etc). For reference types they're references. See http://pobox.com/~skeet/csharp/references.html
Jon Skeet
Thanks Jon, that's what I thought.
Joan Venge
+1  A: 

Generally speaking, a method returning IEnumerable doesn't have to evaluate any of the items before the item is actually needed. So, theoretically, when you return an IEnumerable none of you items need to exist at that time.

So creating a list means that you will really need to evaluate items, get them and place them somewhere in memory (at least their references). There is nothing that can be done about this - if you really need to have a list.

Groo
+1  A: 

A number of other responders have already provided ideas for how to improve the performance of copying an IEnumerable<T> into a List<T> - I don't think that much can be added on that front.

However, based on what you have described you need to do with the results, and the fact that you get rid of the list when you're done (which I presume means that the intermediate results are not interesting) - you may want to consider whether you really need to materialize a List<T>.

Rather than creating a List<T> and operating on the contents of that list - consider writing a lazy extension method for IEnumerable<T> that performs the same processing logic. I've done this myself in a number of cases, and writing such logic in C# is not so bad when using the [yield return][1] syntax supported by the compiler.

This approach works well if all you're trying to do is visit each item in the results and collection some information from it. Often, what you need to do is just visit each element in the collection on demand, do some processing with it, and then move on. This approach is generally more scalable and performant that creating a copy of the collection just to iterate over it.

Now, this advice may not work for you for other reasons, but it's worth considering as an alternative to finding the most efficient way to materialize a very large list.

LBushkin
+1  A: 

From reading the various comments and the question I get the following requirements

for a collection of data you need to run through that collection, filter out some objects and then perform some transformation on the remaining objects. If thats the case you can do something like this:

var result = from item in collection
             where item.Id > 10 //or some more sensible condition
             select Operation(item);

and if you need to the perform more filtering and transformation you can nest your LINQ queries like

var result = from filteredItem in (from item in collection
                                  where item.Id > 10 //or some more sensible condition
                                  select Operation(item))
                 where filteredItem.SomePropertyAvailableAfterFirstTransformation == "new"
                 select SecondTransfomation(filteredItem);
Rune FS
Thanks in your second example. Would it be faster if the nested Select was outside the outer Select?So like:collection = ... select Operation(item)from filtereditem in collection ...?
Joan Venge
The order of the queries will depend on the work done in each, the more work done early on the better. This is basically the rule of moving work from inner loops to outer loops when ever possible. (note: nested queries are not nested loops but the logic still applies since the list of elements passed to the second query will be limited by the where cluase in the first query)
Rune FS