I have a program with many independent computations so I decided to parallelize it.
I use Parallel.For/Each.
The results were okay for a dual-core machine - CPU utilization of about 80%-90% most of the time. However, with a dual Xeon machine (i.e. 8 cores) I get only about 30%-40% CPU utilization, although the program spends quite a lot of time (sometimes more than 10 seconds) on the parallel sections, and I see it employs about 20-30 more threads in those sections compared to serial sections. Each thread takes more than 1 second to complete, so I see no reason for them to not work in parallel - unless there is a synchronization problem.
I used the built-in profiler of VS2010, and the results are strange. Even though I use locks only in one place, the profiler reports that about 85% of the program's time is spent on synchronization (also 5-7% sleep, 5-7% execution, under 1% IO).
The locked code is only a cache (a dictionary) get/add:
bool esn_found;
lock (lock_load_esn)
esn_found = cache.TryGetValue(st, out esn);
if(!esn_found)
{
esn = pData.esa_inv_idx.esa[term_idx];
esn.populate(pData.esa_inv_idx.datafile);
lock (lock_load_esn)
{
if (!cache.ContainsKey(st))
cache.Add(st, esn);
}
}
lock_load_esn
is a static member of the class of type Object.
esn.populate
reads from a file using a separate StreamReader for each thread.
However, when I press the Synchronization button to see what causes the most delay, I see that the profiler reports lines which are function entrance lines, and doesn't report the locked sections themselves.
It doesn't even report the function that contains the above code (reminder - the only lock in the program) as part of the blocking profile with noise level 2%. With noise level at 0% it reports all the functions of the program, which I don't understand why they count as blocking synchronizations.
So my question is - what is going on here?
How can it be that 85% of the time is spent on synchronization?
How do I find out what really is the problem with the parallel sections of my program?
Thanks.
Update: After drilling down into the threads (using the extremely useful visualizer) I found out that most of the synchronization time was spent on waiting for the GC thread to complete memory allocations, and that frequent allocations were needed because of generic data structures resize operations.
I'll have to see how to initialize my data structures so that they allocate enough memory on initialization, possibly avoiding this race for the GC thread.
I'll report the results later today.
Update: It appears memory allocations were indeed the cause of the problem. When I used initial capacities for all Dictionaries and Lists in the parallel executed class, the synchronization problem were smaller. I now had only about 80% Synchronization time, with spikes of 70% CPU utilization (previous spikes were only about 40%).
I drilled even further into each thread and discovered that now many calls to GC allocate were made for allocating small objects which were not part of the large dictionaries.
I solved this issue by providing each thread with a pool of preallocated such objects, which I use instead of calling the "new" function.
So I essentially implemented a separate pool of memory for each thread, but in a very crude way, which is very time consuming and actually not very good - I still have to use a lot of new for the initialization of these objects, only now I do it once globally and there is less contention on the GC thread, even when having to increase the size of the pool.
But this is definitely not a solution I like as it is not generalized easily and I wouldn't like to write my own memory manager.
Is there a way to tell .NET to allocate a predefined amount of memory for each thread, and then take all memory allocations from the local pool?