views:

377

answers:

1

I have a program with many independent computations so I decided to parallelize it.

I use Parallel.For/Each.

The results were okay for a dual-core machine - CPU utilization of about 80%-90% most of the time. However, with a dual Xeon machine (i.e. 8 cores) I get only about 30%-40% CPU utilization, although the program spends quite a lot of time (sometimes more than 10 seconds) on the parallel sections, and I see it employs about 20-30 more threads in those sections compared to serial sections. Each thread takes more than 1 second to complete, so I see no reason for them to not work in parallel - unless there is a synchronization problem.

I used the built-in profiler of VS2010, and the results are strange. Even though I use locks only in one place, the profiler reports that about 85% of the program's time is spent on synchronization (also 5-7% sleep, 5-7% execution, under 1% IO).

The locked code is only a cache (a dictionary) get/add:

bool esn_found;
lock (lock_load_esn)
    esn_found = cache.TryGetValue(st, out esn);
if(!esn_found)
{
    esn = pData.esa_inv_idx.esa[term_idx];
    esn.populate(pData.esa_inv_idx.datafile);
    lock (lock_load_esn)
    {
        if (!cache.ContainsKey(st))
            cache.Add(st, esn);
    }
}

lock_load_esn is a static member of the class of type Object.
esn.populate reads from a file using a separate StreamReader for each thread.

However, when I press the Synchronization button to see what causes the most delay, I see that the profiler reports lines which are function entrance lines, and doesn't report the locked sections themselves.
It doesn't even report the function that contains the above code (reminder - the only lock in the program) as part of the blocking profile with noise level 2%. With noise level at 0% it reports all the functions of the program, which I don't understand why they count as blocking synchronizations.

So my question is - what is going on here?
How can it be that 85% of the time is spent on synchronization?
How do I find out what really is the problem with the parallel sections of my program?

Thanks.

Update: After drilling down into the threads (using the extremely useful visualizer) I found out that most of the synchronization time was spent on waiting for the GC thread to complete memory allocations, and that frequent allocations were needed because of generic data structures resize operations.

I'll have to see how to initialize my data structures so that they allocate enough memory on initialization, possibly avoiding this race for the GC thread.

I'll report the results later today.

Update: It appears memory allocations were indeed the cause of the problem. When I used initial capacities for all Dictionaries and Lists in the parallel executed class, the synchronization problem were smaller. I now had only about 80% Synchronization time, with spikes of 70% CPU utilization (previous spikes were only about 40%).

I drilled even further into each thread and discovered that now many calls to GC allocate were made for allocating small objects which were not part of the large dictionaries.

I solved this issue by providing each thread with a pool of preallocated such objects, which I use instead of calling the "new" function.

So I essentially implemented a separate pool of memory for each thread, but in a very crude way, which is very time consuming and actually not very good - I still have to use a lot of new for the initialization of these objects, only now I do it once globally and there is less contention on the GC thread, even when having to increase the size of the pool.

But this is definitely not a solution I like as it is not generalized easily and I wouldn't like to write my own memory manager.
Is there a way to tell .NET to allocate a predefined amount of memory for each thread, and then take all memory allocations from the local pool?

+4  A: 

Can you allocate less?

I've had a couple similar experiences, looking at bad perf and discovering the heart of the issue was the GC. In each case, though, I discovered that I was accidentally hemorrhaging memory in some inner loop, allocating tons of temporary objects needlessly. I'd give the code a careful look and see if there are allocations you can remove. I think it's rare for programs to 'need' to allocate heavily in inner loops.

Brian
I believe I could allocate less. But this would mean e.g. implementing trees of complex structs as multi-dimensional arrays (instead of allocating each node separately), alongside other non intuitive implementations of data structures. This could be done, but frankly - I don't believe this should be a general solution of memory allocation contention.
Haggai
(Hard to speculate without knowing more about the details of these algorithms/data structures.)
Brian
BTW, have you run the memory profiler? It sometimes points to gross errors, like you see a billion strings being allocated, and realize 'whoops', instead of using + on strings, I should be using a StringBuilder, or whatnot. I'd be curious to know exactly what data structures are being allocated and then quickly discarded inside the inner loop.
Brian
Profiling the memory was actually a great suggestion. It let me see what objects are being allocated the most, and allowed me to solve the issue using per-thread memory pools, dedicated to the types I allocate the most. Obviously this means that I have to do memory management by myself, which is ugly and not efficient at all, but it's better than nothing. Current CPU utilization averages 40%-50%, with spikes at 80%, but this has little effect on overall program time... I guess I'll have to fine-tune the global and local pool sizes.
Haggai
Still, I would much prefer the .NET would do per-thread memory management for me. I'll take the reduction of allocations as an answer, since this is practically what I've done with the pools.
Haggai