Across multiple languages (mostly D and Java/Jython) I've noticed that parallel programs with no obvious synchronization bottleneck often don't scale well to 4 or more cores because of memory management bottlenecks. I'm aware that thread-local allocators mitigate this problem, but most garbage collector implementations still need to stop the world. Garbage collection is not embarrassingly parallel (shared state has to be updated way too often), so using a parallel collector doesn't completely solve the problem. In the case of manual memory management, even if allocations are mostly from a thread-local allocator, the memory still has to be freed, possibly from a different thread than the one it was allocated in.
Is there any language/runtime/malloc implementation for which the memory management bottleneck to SMP parallelism is for all practical purposes a solved problem, while still allowing traditional shared address space multithreading?