views:

393

answers:

5

I recently benchmarked the .NET 4 garbage collector, allocating intensively from several threads. When the allocated values were recorded in an array, I observed no scalability just as I had expected (because the system contends for synchronized access to a shared old generation). However, when the allocated values were immediately discarded, I was horrified to observe no scalability then either!

I had expected the temporary case to scale almost linearly because each thread should simply wipe the nursery gen0 clean and start again without contending for any shared resources (nothing surviving to older generations and no L2 cache misses because gen0 easily fits in L1 cache).

For example, this MSDN article says:

Synchronization-free Allocations On a multiprocessor system, generation 0 of the managed heap is split into multiple memory arenas using one arena per thread. This allows multiple threads to make allocations simultaneously so that exclusive access to the heap is not required.

Can anyone verify my findings and/or explain this discrepancy between my predictions and observations?

+2  A: 

or explain this discrepancy between my predictions and observations?

Benchmarking is hard.
Benchmarking a subsystem that is not under your full control is even harder.

Henk Holterman
+8  A: 

Not so sure what this is about and exactly what you saw on your machine. There are however two distinct versions of the CLR on your machine. Mscorwks.dll and mscorsvc.dll. The former is the one you get when you run your program on a work station, the latter on one of the server versions of Windows (like Windows 2003 or 2008).

The work station version is kind to your local PC, it doesn't gobble all machine resources. You can still read your email while a GC is going on. The server version is optimized to scale on server level hardware. Lots of RAM (GC doesn't kick in that quick) and lots of CPU cores (garbage gets collected on more than one core). Your quoted article probably talks about the server version.

You can select the server version on your workstation, use the <gcServer> element in your .config file.

Hans Passant
A: 

I can hazard a couple of guesses as to what is happening.

(1) If you have a single thread and there is M space free in generation 0, then the GC will only run once M bytes have been allocated.

(2) If you have N threads and the GC divides up generation 0 into N/M space per thread, the GC will end up running every time a thread allocates N/M bytes. The showstopper here is that the GC needs to "stop the world" (i.e., suspend all running threads) in order to mark references from the threads' root sets. This is not cheap. So, not only will the GC run more often, it will be doing more work on each collection.

The other problem, of course, is that multi-threaded applications aren't typically very cache friendly, which can also put a significant dent in your performance.

I don't think this is a .NET GC issue, rather it's an issue with GC in general. A colleague once ran a simple "ping pong" benchmark sending simple integer messages between two threads using SOAP. The benchmark ran twice as fast when the two threads were in separate processes because memory allocation and management was completely decoupled!

Rafe
@Rafe: "the GC needs to stop the world". Are you sure? I can imagine designs where all of the roots to objects in the nursery generation are in the global variables, (one) thread-local stack and a remembered set generated by the write barrier.
Jon Harrop
@Jon: it's a little late, so I may be off base here, but wouldn't requiring each machine register to be backed by a global or stack slot rather negate a lot of code generation optimisations? Also, write barriers don't come cheap. What I have in mind is this: the GC has no way of knowing that thread i hasn't communicated a reference to a local object to thread j, so it needs to check j's roots to look for references into i's nursery.Either way, my reading of the linked article under "Performance for Multithreaded Applications" was that the .NET GC is a stop-the-world type.
Rafe
@Rafe: "wouldn't requiring each machine register to be backed by a global or stack slot rather negate a lot of code generation optimisations". No, I used the technique in HLVM and it generates very fast code.
Jon Harrop
@Rafe: "the GC has no way of knowing that thread i hasn't communicated a reference to a local object to thread j, so it needs to check j's roots to look for references into i's nursery". Many GC designs prevent pointers from shared data back to a thread-local nursery. It depends entirely upon their design. For example, this is usually accomplished by collecting the nursery generation when a pointer into it is written into the global heap.
Jon Harrop
@Rafe: "the .NET GC is a stop-the-world type". To some extent, yes. I believe .NET stops the world for a very short time in order to get a consistent snapshot of the global roots. However, it does not stop the world for entire collections. In particular, threads that do not allocate heavily are not exposed to the latency incurred by other threads allocating heavily.
Jon Harrop
@Jon: backing every machine register with a memory location does cost you quite a bit. The whole point of register colouring (which is a major optimisation) is to minimise the need to spill registers to the stack. The benefits of register colouring are noticeably reduced if you have to record in memory every change to a register containing a possible pointer value.
Rafe
@Jon: the write barrier techniques I think you're talking about are expensive. Without write barriers, my compiler can implement an assignment operation in a single machine instruction. With write barriers, almost every assignment has to be checked to see whether the target of the assignment is "local". This involves range checks and conditional branches, which are much more code than a single store instruction. The reason write barriers are reasonable in functional languages (where they were developed, IIRC), is because assignments are rare in FPLs.
Rafe
@Jon: sure, the GC only has to stop the world to find the roots in all the threads in the program, but both stopping-the-world and tracing all root sets is relatively expensive when you consider what actually has to happen.I think we may be talking at cross-purposes here. Within the space of languages implemented on a VM, .NET is very good. However, these languages tend to compare poorly against "leaner" languages (e.g., C, C++, Mercury, OCaml). I think that is partly because the trade offs in a VM design sacrifice some performance in the name of portability and flexibility.
Rafe
@Rafe: "I think that is partly because the trade offs in a VM design sacrifice some performance in the name of portability and flexibility". Yes. For one, .NET conveys complete run-time type information.
Jon Harrop
@Rafe: "Without write barriers, my compiler can implement an assignment operation in a single machine instruction". Sure but that makes it so much harder to implement an efficient GC that, in practice, write barriers are ubiquitous.
Jon Harrop
@Rafe: "The benefits of register colouring are noticeably reduced if you have to record in memory every change to a register containing a possible pointer value". You don't have to. You only need to make sure the in-memory representation is accurate at safe points, not all the time.
Jon Harrop
+3  A: 

Not a complete answer to the question, but just to clear up some misconceptions: the .NET GC is only concurrent in workstation mode. In server mode, it uses stop-the-world parallel GC. More details here. The separate nurseries in .NET are primarily to avoid synchronisation on allocation; they are nevertheless part of the global heap and cannot be collected separately.

Simon Marlow
@Simon: "they are nevertheless part of the global heap and cannot be collected separately". This is exactly what I needed to know. Thank you!
Jon Harrop
+1  A: 

Very quick, easy to see (straight at root, assigning nulls) and massive releases can trick GC into being eager and the whole idea of a cache-local heap is a nice dream :-) Even if you had fully separated thread-local heaps (which you don't) the handle-pointer table would still have to be fully volatile just to make is safe for general multi-CPU scenarios. Oh and remember that there are many threads, CPU cache is shared, kernel needs take the precedence so it's not all just for you :-)

Also beware that "heap" with double pointers has 2 parts - block of memory to give and the handle-pointer table (so that blocks can be moved but your code always has one address). Such table is a critical but very lean process-level resource and just about the only way to stress it is to flood it with massive quick releases - so you managed to do it :-))

In general the rule of GC is - leak :-) Not forever of course, but kind of for as long as you can. If you remember how people go around telling "don't force GC collections"? That's the part of the story. Also the "stop the world" collection is actually much more efficient than "concurrent" and used to be known by a nicer name of cycle stealing or sheduler cooperation. Only the mark phase needs to freeze the scheduler and on a server there's a burst of several threads doing it (N cores are idle anyway :-) The only reason for the other one is that it can make real-time operations like playing videos jittery, just as the longer thread quantum does.

So again if you go competing with infrastructure on short and frequent CPU bursts (small alloc, almost no work, quick release) the only thing you'll see/measure will be the GC and JIT noise.

If this was for something real, i.e. not just experimenting, the best you can do is to use big value arrays on stack (structs). They can't be forced onto heap and are as local as a local can get, and not subject to any backdoor moving => cache has to love them :-) That may mean switching to "unsafe" mode, using normal pointers and maybe doing a bit of alloc on your own (if yopu need something simple like lists) but that's a small price to pay for kicking GC out :-) Trying to force data into a cache also depends on keeping your stacks lean otherwise - rememeber that you are not alone. Also giving your threads some work that's worth at least several quantums berween releases may help. Worst case scenario would be if you alloc and release within a signle quantum.

ZXX