views:

205

answers:

7

I am trying to get to grips with LINQ. The thing that bothers me most is that even as I understand the syntax better, I don't want to unwittingly sacrifice performance for expressiveness.

Are they any good centralized repositories of information or books for 'Effective LINQ' ? Failing that, what is your own personal favourite high-performance LINQ technique ?

I am primarily concerned with LINQ to Objects, but all suggestions on LINQ to SQL and LINQ to XML also welcome of course. Thanks.

A: 

In linq to SQL you don't need to care that much about performance. you can chain all your statements in the way you think it is the most readable. Linq just translates all your statements into 1 SQL statement in the end, which only gets called/executed in the end (like when you call a .ToList()

a var can contain this statement without executing it if you want to apply various extra statements in different conditions. The executing in the end only happens when you want to translate your statements into a result like an object or a list of objects.

Stefanvds
This is only for LINQ to SQL.
nlawalker
how was I supposed to know if steve didn't specify...
Stefanvds
He did. See the last sentence of the first revision.
Rex M
@Stefanvds - that's fine, just edit this into your response
Steve Townsend
-1. Sometimes LinqToSql will translate your query into n+1 statements. Sometimes the query produced is less than optimal. LinqToSql queries require database performance analysis, no different than any other query from any other source really.
David B
+3  A: 

I think the most important thing to understand with LINQ performance is that your queries are never executed until you actually iterate over the results.

var someList = from p in People
               where p.Age >= 21
               select p;

That statement by itself isn't going to run the query. When you do something like:

 someList.ToList()

OR

foreach( var person in someList )

This is when your query will execute...and it will execute every time you write a statement like that.

Dismissile
+4  A: 

There are various factors which will affect performance.

Often, developing a solution using LINQ will offer pretty reasonable performance because the system can build an expression tree to represent the query without actually running the query while it builds this. Only when you iterate over the results does it use this expression tree to generate and run a query.

In terms of absolute efficiency, running against predefined stored procedures you may see some performance hit, but generally the approach to take is to develop a solution using a system that offers reasonable performance (such as LINQ), and not worry about a few percent loss of performance. If a query is then running slowly, then perhaps you look at optimisation.

The reality is that the majority of queries will not have the slightest problem with being done via LINQ. The other fact is that if your query is running slowly, it's probably more likely to be issues with indexing, structure, etc, than with the query itself, so even when looking to optimise things you'll often not touch the LINQ, just the database structure it's working against.

For handling XML, if you've got a document being loaded and parsed into memory (like anything based on the DOM model, or an XmlDocument or whatever), then you'll get more memory usage than systems that do someting like raising events to indicate finding a start or end tag, but not building a complete in-memory version of the document (like SAX or XmlReader). The downside is that the event-based processing is generally rather more complex. Again, with most documents there won't be a problem - most systems have several GB of RAM, so taking up a few MB representing a single XML document is not a problem (and you often process a large set of XML documents at least somewhat sequentially). It's only if you have a huge XML file that would take up 100's of MB that you worry about the particular choice.

Bear in mind that LINQ allows you to iterate over in-memory lists and so on as well, so in some situations (like where you're going to use a set of results again and again in a function), you may use .ToList or .ToArray to return the results. Sometimes this can be useful, although generally you want to try to use the database's querying rather in-memory.

As for personal favourites - NHibernate LINQ - it's an object-relational mapping tool that allows you to define classes, define mapping details, and then get it to generate the database from your classes rather than the other way round, and the LINQ support is pretty good (certainly better than the likes of SubSonic).

David Burton
A: 

There's a codeplex project called i4o which I used a while back which can help improve the performance of Linq to Objects in cases where you're doing equality comparisons, e.g.

from p in People 
where p.Age == 21 
select p;

http://i4o.codeplex.com/ I haven't tested it with .Net 4 so can't safely say it will still work but worth checking out. To get it to work its magic you mostly just have to decorate your class with some attributes to specify which property should be indexed. When I used it before it only works with equality comparisons though.

theburningmonk
I've found that in general if I'm doing something with LINQ that needs indexes to speed it up it's usually better to do the indexes myself since a) I may want them for other purposes and b) I control how the indexes are managed. Also this allows me to create much more comple indexes
RobV
+2  A: 

Simply understanding what LINQ is doing internally should yield enough information to know whether you are taking a performance hit.

Here is a simple example where LINQ helps performance. Consider this typical old-school approach:

List<Foo> foos = GetSomeFoos();
List<Foo> filteredFoos = new List<Foo>();
foreach(Foo foo in foos)
{
    if(foo.SomeProperty == "somevalue")
    {
        filteredFoos.Add(foo);
    }
}
myRepeater.DataSource = filteredFoos;
myRepeater.DataBind();

So the above code will iterate twice and allocate a second container to hold the filtered values. What a waste! Compare with:

var foos = GetSomeFoos();
var filteredFoos = foos.Where(foo => foo.SomeProperty == "somevalue");
myRepeater.DataSource = filteredFoos;
myRepeater.DataBind();

This only iterates once (when the repeater is bound); it only ever uses the original container; filteredFoos is just an intermediate enumerator. And if, for some reason, you decide not to bind the repeater later on, nothing is wasted. You don't even iterate or evaluate once.

When you get into very complex sequence manipulations, you can potentially gain a lot by leveraging LINQ's inherent use of chaining and lazy evaluation. Again, as with anything, it's just a matter of understanding what it is actually doing.

Rex M
Thanks, this is an example of exactly what I was hoping to see. I just wish it was easier to learn by example than trial and error or *a priori* analysis.
Steve Townsend
+2  A: 

Linq, as a built-in technology, has performance advantages and disadvantages. The code behind the extension methods has had considerable performance attention paid to it by the .NET team, and its ability to provide lazy evaluation means that the cost of performing most manipulations on a set of objects is spread across the larger algorithm requiring the manipulated set. However, there are some things you need to know that can make or break your code's performance.

First and foremost, Linq doesn't magically save your program the time or memory needed to perform an operation; it just may delay those operations until absolutely needed. OrderBy() performs a QuickSort, which will take nlogn time just the same as if you'd written your own QuickSorter or used List.Sort() at the right time. So, always be mindful of what you're asking Linq to do to a series when writing queries; if a manipulation is not necessary, look to restructure the query or method chain to avoid it.

By the same token, certain operations (sorting, grouping, aggregates) require knowledge of the entire set they are acting upon. The very last element in a series could be the first one the operation must return from its iterator. On top of that, because Linq operations should not alter their source enumerable, but many of the algorithms they use will (i.e. in-place sorts), these operations end up not only evaluating, but copying the entire enumerable into a concrete, finite structure, performing the operation, and yielding through it. So, when you use OrderBy() in a statement, and you ask for an element from the end result, EVERYTHING that the IEnumerable given to it can produce is evaluated, stored in memory as an array, sorted, then returned one element at a time. The moral is, any operation that needs a finite set instead of an enumerable should be placed as late in the query as possible, allowing for other operations like Where() and Select() to reduce the cardinality and memory footprint of the source set.

Lastly, Linq methods drastically increase the call stack size and memory footprint of your system. Each operation that must know of the entire set keeps the entire source set in memory until the last element has been iterated, and the evaluation of each element will involve a call stack at least twice as deep as the number of methods in your chain or clauses in your inline statement (a call to each iterator's MoveNext() or yielding GetEnumerator, plus at least one call to each lambda along the way). This is simply going to result in a larger, slower algorithm than an intelligently-engineered inline algorithm that performs the same manipulations. Linq's main advantage is code simplicity. Creating, then sorting, a dictionary of lists of groups values is not very easy-to-understand code (trust me). Micro-optimizations can obfuscate it further. If performance is your primary concern, then don't use Linq; it will add approximately 10% time overhead and several times the memory overhead of manipulating a list in-place yourself. However, maintainability is usually the primary concern of developers, and Linq DEFINITELY helps there.

On the performance kick: If performance of your algorithm is the sacred, uncompromisable first priority, you'd be programming in an unmanaged language like C++; .NET is going to be much slower just by virtue of it being a managed runtime environment, with JIT native compilation, managed memory and extra system threads. I would adopt a philosophy of it being "good enough"; Linq may introduce slowdowns by its nature, but if you can't tell the difference, and your client can't tell the difference, then for all practical purposes there is no difference. "Premature optimization is the root of all evil"; Make it work, THEN look for opportunities to make it more performant, until you and your client agree it's good enough. It could always be "better", but unless you want to be hand-packing machine code, you'll find a point short of that at which you can declare victory and move on.

KeithS