views:

310

answers:

7

I am using the task parallel library from .NET framework 4 (specifically Parallel.For and Parallel.ForEach) however I am getting extremely mediocre speed-ups when parallelizing some tasks which look like they should be easily parallelized on a dual-core machine.

In profiling the system, it looks like there is a lot of thread synchronization going on because of the garbage collector. I am doing a lot of allocation of objects, so I am wondering how I can improve the concurrency while minimizing a rewrite of my code.

For example are there some techniques that can be useful in this situation:

  • Should I try to manage the GC manually?
  • Should I be using Dispose?
  • Should I be pinning objects?
  • Should I be doing other unsafe code tricks?

POSTSCRIPT:

The problem is not the GC running too often, it is that the GC prevents concurrent code from being running in parallel efficiently. I also don't consider "allocate fewer objects" to be an acceptable answer. That requires rewriting too much code to work around a poorly parallelized garbage collector.

I already found one trick which helped overall performance (using gcServer) but it didn't help the concurrent performance. In other words Parallel.For was only 20% faster than a serial For loop, on an embarrassingly parallel task.

POST-POSTSCRIPT:

Okay, let me explain further, I have a rather big and complex program: an optimizing interpreter. It is fast enough, but I want its performance when given parallel tasks (primitive operations built into my language) to scale well as more cores are available. I allocate lots of small object during evaluations. The whole interpreter design is based on all values being derived from a single polymorphic base object. This works great in a single-threaded application, but when we try to apply the Task Parallel Library to parallel evaluations there is no advantage.

After a lot of investigation into why the Task Parallel Library was not properly distributing work across cores for these tasks, it seems the culprit is the GC. Apparently the GC seems to act as a bottle-neck because it does some behind the scene thread synchronization that I don't understand.

What I need to know is: what exactly is the GC doing that can cause heavily concurrent code to perform badly when it does lots of allocations, and how we can work around that other than just allocating fewer objects. That approach has already occurred to me, and would require a significant rewrite of a lot of code.

A: 

1) You can't and shouldn't manage the GC manually.

2) Dispose is only an indication to the GC, it will anyway pass whenever he feels right. :P

The only way to avoid these problems is to profile your app and try as much as possible to avoid allocating new objects. When you've find what's going into the garbage collector, try some pooling technique to reuse those data and avoid recreating it every time.

EDIT : Whenever the GC is running ALL threads must go in a sleep state to allow it to do his work. That's the reason of the slowdown if the collections are many as in your case. There is no possible other way to manage this than to reduce the new objects generation.

feal87
Dispose, frees the resources associated with an object. The garbage collector removes the object from memory. Disposed gets run when it is called, a possible finalizer (called before destruction of the object) when the garbage collector decides it is time to remove the object.
Obalix
Your solution seems to be effectively "don't allocate so many objects", right? Can you convince me why this is the best bet here, or provide more information? For example, why is the garbage collector so darn lame at dealing with many objects and heavily concurrent code? If you can expand your answer I'll upvote.
cdiggins
Added more info, but really there isn't any other way. :P
feal87
@cdiggins: perhaps you'd care to implement a low-overhead concurrent garabage collector? ;) It's so "darn lame" because you're asking it to do the impossible. The more objects you allocate, the more often the GC has to run. And while yes, concurrent GC's exist, they are generally much less efficient. .NET has aimed for an efficient GC, at the cost of losing the ability to run it concurrently.It's really common sense: If the GC takes too much time, give it less work to do.
jalf
+4  A: 

If GC is running too often due to too many objects being allocated/GC-ed, try to allocate fewer of them :)

Depending on you scenario - try to reuse existing objects, create an object pool, use "lighter" objects that do not put so much memory pressure (or larger to reduce the number of objects allocated).

Do not try to "manage GC" by calling GC.Collect explicitly, it very rarely pays off (Rico Mariani says so)

or http://blogs.msdn.com/ricom/archive/2003/12/02/40780.aspx

Marek
+1 for 'Rico says so' :)
Steven
The problem is not the GC running too often, it is that the GC prevents concurrent code from being running in parallel efficiently.
cdiggins
I do not know your exact scenario, but do you suspect that if the GC will run every 10 seconds instead of every 10 ms, will it still prevent the concurrent code from running in parallel efficiently?
Marek
I don't know Marek, I am simply at a loss to understand how the GC is apparently triggering so many synchronizations events, and preventing my application from leveraging multiple cores effectively.
cdiggins
+1  A: 

This is a fact of life. Almost all memory management schemes serialize code that looks embarrassingly parallel to some degree. I think C# has thread-local allocators, so it should only be serializing on collections. Nonetheless, I'd recommend pooling/reusing your most frequently allocated objects and arrays and maybe convert some small, non-polymorphic objects to structs and seeing if that helps.

dsimcha
+1  A: 

For your four points:

  1. See http://stackoverflow.com/questions/2311154/how-can-i-improve-garbage-collector-performance-of-net-4-0-in-highly-concurrent/2311171#2311171 (1)
  2. You should dispose if your objects hold resources, especially resources to non-managed objects. Dispose gets executed immediately. A possible finalizer (~ Destructor in C++) gets only called when the GC runs and the object is removed from memory.
  3. Pinning the objects makes only sense if the object is passed to a non-managed piece of code, e.g. an unmanaged c++ dll. Othewise, leave the garbage collector to do its share in keeping the memory tidy. Pinning also can lead to memory fragmentation.
  4. Not if you don't have to.

One thing to think about, is to move the allocation out of your loops - if that is possible. In many cases when you can do this, it also allows you to reuse already allocated objects, thus providing additional performance (at least that what's my experience shows) (See also http://stackoverflow.com/questions/2311154/how-can-i-improve-garbage-collector-performance-of-net-4-0-in-highly-concurrent/2311215#2311215).

The grade of parallel execution always depends on the task you are doing, in case of an computation the maximum achievable parallelism is < n times, where n is the number of processors - pure computation. In case of input or output operations n will usually be exceeded.

Obalix
A: 

In profiling the system, it looks like there is a lot of thread synchronization going on because of the garbage collector. I am doing a lot of allocation of objects, so I am wondering how I can improve the concurrency while minimizing a rewrite of my code.

Don't do a lot of allocation of objects. The only universal way to speed up your code is to make it do less work. If the GC takes too much time, there are two theoretical options:

  • Implement a better GC, or
  • Give the GC less work to do

The first point is pretty much impossible. It'd take a lot of hacking to replace the .NET GC in the first place, and it'd take a lot of work to design a GC that's even remotely as efficient as the .NET one.

The second point is really your only option: If a garbage collection requires synchronization, make sure that fewer collections take place. They generally occur when the gen0 heap is too full to satisfy an allocation request.

So make sure that doesn't happen. Don't allocate so many objects. You have several ways to avoid it:

  1. using (stack-allocated) structs instead of classes may help reduce the GC pressure. Especially small, short-lived objects would probably benefit from being converted to structs,
  2. Reuse the objects you allocate. Longer-lived objects are moved to the larger heaps where collections rarely take place. Move allocations out of loops, for example.
jalf
A: 

Parallel tasks and even raw Threading are not magic bullets to make your code go faster. If you have any locks, resources, or only have a few cores you can slow code down my trying to be multi-threaded. You also need to make sure you are not having context swaps and hopefully you have more then 4 cores. (Don't forget the GC, CLR, Windows, as well as other applications and services are contending for resources/cycles.)

You should also know that pinning and unsafe code could slowdown some actions. They require special operations from both the CLR and the GC to make sure that memory and resources are kept safe (for example the GC can’t compact as well if you pin or if you are unsafe.)

The Parallel task library has been created for general purpose uses. If you need highly optimized code you may need to manage your own threads as well. (Unlike many of the blogs say... there are no magic bullets in this profession.)

Your best bet will be to create an instance of your worker class per thread to avoid the construction and deconstruction per action. Check out ThreadStaticAttribute. It is my understanding there are other options in .Net 4.0 but I have not had a chance to work with them yet.

Matthew Whited
+1  A: 

I have an idea -- why not try an alternate GC implementation? .NET provides three.

http://blogs.msdn.com/maoni/archive/2004/09/25/234273.aspx

Based on your problem description, I'd be curious to see how the server GC works out for you, since it provides a separate heap per core. It's probably also worth looking into the Background GC mode that .NET 4 adds.

http://blogs.msdn.com/maoni/archive/2008/11/19/so-what-s-new-in-the-clr-4-0-gc.aspx

Hopefully that's a little more helpful to your specific case than the answers so far.

Promit