views:

91

answers:

1

I have a created a single LINQ query that creates a main group and then two nested groups. Within the last nest there is also a simple OrderBy. The issue I am running into is while writing the query or trying to edit it the visual studio memory consumption sky rockets to ~500MB and is eating 50% of my CPU, which makes visual studio unresponsive for a few minutes. If I comment out the query then visual studio acts just fine. So my question is why is visual studio consuming so much memory during design time for a linq query, granted it is kind of complicated?

The datatable I am using is 10732 rows long by 21 columns across

var results = from p in m_Scores.AsEnumerable()
       group p by p.Field<string>("name") into x
       select new
       {
       Name = x.Key,
       Members = from z in x
             group z by z.Field<string>("id") into zz
             select new
               {
               Id = zz.Key,
               Plots = from a in zz
                   group a by a.Field<string>("foo") into bb
                   select new
                   {
                       Foo = bb.Key,
                       Bars = bb
                   }.Bars.OrderBy(m => m.Field<string>("foo"))
               }
       };

Hardware Specs:

Dell Latitude with a 2.20GHz Dual Core Processor and 4GB ram

+1  A: 

The problem with groups and orders is that they require knowledge of the entire collection in order to perform the operations. Same thing with aggregates like min, max, sum, avg, etc. Some of these operations cannot interrogate the true type of the IEnumerable being passed, or it wouldn't matter, since they are "destructive" in nature so they have to create a working copy. When you chain these things together, you end up with at least two copies of the full enumerable; the one produced by the previous method that is being iterated by the current method, and the one being generated by the current method. Any copy of the enumerable that has a reference outside the query (like the source enumerable) also stays in memory, and enumerables that have become orphaned stick around until the GC thread has time to dispose of and finalize them. For a large source enumerable, all of this can create a massive demand on the heap.

In addition, nesting aggregates in clauses can very quickly make the query expensive. Unlike a DBMS which can engineer a "query plan", Linq isn't quite that smart. Min(), for example, requires iteration of the whole enumerable to find the smallest value of the specified projection. When that's a criteria of a Where clause, a good DBMS will find that value once per context and then inline the value in subsequent evaluations as necessary. Linq just runs the extension method each time it is called, and when you have a condition like enumerable.Where(x=>x.Value == enumerable.Min(x2=>x2.Value)), that's an O(N^2)-complexity operation just to evaluate the filter. Add multiple levels of grouping and the Big-O can easily reach high-polynomial complexity.

Generally, you can reduce query time by performing the optimizations that a DBMS would given the same query. If the value of an aggregate can be known for the full scope of the query (e.g. result = source.Where(s=>s.Value == source.Min(x=>x.value))), evaluate that into a variable using the let clause (or an external query), and replace Min() calls with the alias. Iterating the enumerable twice is generally cheaper than iterating it N^2 times, especially if the enumerable stays in memory between iterations.

Also, make sure your query order reduces the sample space as much and as cheaply as possible before you start grouping. You may be able to make educated guesses about conditions that must be evaluated expensively, such as Where(s=>s.Value < threshold).Where(s=>s.Value == source.Min(x=>x.Value)) or more concisely, Where(s=>s.Value < threshold && s.Value == source.Min(x=>x.Value)) (the second works in C# because of lazy condition evaluation, but not all languages evaluate lazily). That cuts the number of evaluations of Min() down to the number of elements meeting the first criteria. You can use existing criteria to do the same thing, wherever the criteria A and B are independent enough that A && B == B && A.

KeithS
Even in design and runtime this happens. I understand how this can affect runtime performance but not design time.
Nathan
In design time, there's a lot going on behind the scenes as you're typing. The "check-as-you-go" features of VS/ReSharper perform various analyses of the code; for instance, ReSharper checks to see if any conditions are always true or always false. Some of these require execution-path analysis, which will be at least somewhat bound in time to the complexity of the algorithm itself.
KeithS
Ah ok, that makes sense now. Thanks for the lesson!
Nathan