views:

117

answers:

7

VS 7.1 release mode does not seem to be properly parallelizing threads while debug mode does. Here is a summary of what is happening.

First, for what it's worth, here is the main piece of code that parallelizes, but I don't think it's an issue:

       // parallelize the search

       CWinThread* thread[THREADS];
       for ( i = 0; i < THREADS; i++ ) {
           thread[i] = AfxBeginThread( game_search, &parallel_params[i],
                                       THREAD_PRIORITY_NORMAL, 0, CREATE_SUSPENDED );
           thread[i]->m_bAutoDelete = FALSE;
           thread[i]->ResumeThread();
       }
       for ( i = 0; i < THREADS; i++ ) {
           WaitForSingleObject(thread[i]->m_hThread, INFINITE);
           delete(thread[i]);
       }

THREADS is a global variable that I set and I recompile if I want to change the number of threads. To give a bit of context this is a game playing program that searches game positions.

Here is what happens that doesn't make sense to me.

First, compiling in debug mode. If I set THREADS to 1 the one thread manages to search about 13,000 positions. If I set THREADS to 2, each thread searches about 13,000 positions. Great!

If I compile in release mode and set THREADS to 1 the thread manages to search about 30,000 positions, a typical speedup I'm used to seeing when moving from debug to release. But here is the kicker. When I compile with THREADS = 2 each thread only searches about 15,000 positions. Obviously half of what THREADS = 1 does, so effectively a release compile gives me no effective speedup whatsoever. :(

Watching task manager when these things run, with THREADS = 1 I see 50% CPU usage on my dual core machine and when THREADS = 2 I see 100% CPU usage. But the release compile seems to be giving me an effective CPU usage of 50%. Or something?!

Any thoughts? Is there something I should be setting in the Property Pages?


Update: The following is also posted below but it was suggested I update this post. It was also suggested I post code, but it is a quite large project. I'm hoping others have run into this kind of behavior themselves in the past and can shed some light on what going on.


I ran the program on a quad core system and got consistent but still confusing results. I know I am verging on getting away from a specific programming question and becoming a bit abstract, but I'd really like to hear any comments you might have to help explain the numbers I am seeing. For all of these tests I run for 30 seconds and according to task manager all threads are running full power for the entire 30 seconds.

When running in Debug mode, if I run with 1 thread it gets X amount of work done. If I run 2 threads each thread gets X amount of work done. Similarly with 3 and 4 threads. Scaling is perfect.

When running in Release mode, this is what happens:

With 1 thread: it gets Y amount of work done, where Y is nearly double X.

With 2 threads: Each thread gets Y amount of work done. Again, perfect scaling.

With 3 threads: 1 thread gets Y amount of work done, the other 2 threads get 2/3 Y amount of work done. I've lost about 2/3 of a CPU even though one is presumable completely idle. Task Manager shows 75% CPU usage.

With 4 threads: 1 thread gets Y amount of work done. The other 3 threads get 1/2 Y amount of work done. Now I've lost about 1.5 CPU's worth of computing. The Task Manager shows 100% CPU usage.

The obvious questions are:

(1) Repeating the earlier question, was does Debug mode scale so well, but not Release?

(2) Why is one core always able to get full usage but the others seem to fall off? This lack of symmetry is disturbing.

(3) Why are the others falling off? Memory bandwidth was suggested earlier but that seem like an awfully steep price.

Any comments or insights are most welcome. And, as always, thanks!

+1  A: 

I think you should be using WaitForMultipleObjects().

jeffamaphone
That's a good point. I should probably switch. But it seems unlikely that is causing the slow behavior in Release mode but not Debug mode, don't you think?
No, as others point out Release mode does all sorts of optimizations that could be responsible for what you're seeing. Are you using /O* or /G* in your compile/link?
jeffamaphone
A: 

The problem with multi-threading is that it is non-deterministic.

First of all, the DEBUG target doesn't optimize the code. It also adds additional code for runtime checks (e.g. asserts, traces in MFC, etc.).

The RELEASE target is optimized. So in release mode, the binary can be slightly different than in case of DEBUG mode.

What is the job executed by the thread is also important. For example, if your threads are using some IO operations, they will have some idle times, waiting for those IO operations to complete. Since in RELEASE mode the code to be executed is expected to be more efficient, the ratio between idle time and execution time might be different than in DEBUG mode.

I am only guessing possible explanations, given the provided information.

Later update: You can use WaitForMultipleObjects to wait for all the threads to finish:

DWORD result = WaitForMultipleObjects( 
  numberOfThreads,  // Number of thread handles in the array
  threadHandleArray,  // the array of thread handles
  true, // true means wait for all the threads to finish
  INFINITE); // wait indefinetly
if( result == WAIT_FAILED)
  // Some error handling here
Cătălin Pitiș
A: 

I'm not sure I understand why there are a different number of positions searched in Debug vs. Release. You are waiting for the threads to complete, so I would just expect the Release version to finish faster but for both versions to generate the same results.

Are you imposing a per-thread time limit? If so what is the mechanism for this?

In the absence of logic bugs, it would appear that your processing is CPU limited for the Debug case in both single and double threaded versions. In the release case, you are not getting any effective speedup which means that either the processing is more efficient and the processing is now limitied by something else (e.g. IO or memory bandwidth) or that any gains that you are making are offset by frequent context switching between the threads which might happen if you have a poor synchronization strategy between the threads.

It would be helpful to know exact what processing each thread does, what shared data they have and how often they need to synchronize with each other.

Charles Bailey
A: 

As Charles Bailey said, from you description it seems like you are imposing a per-thread time limit.

It could be the case that the timing mechanism you use references wall clock time in debug mode and CPU time (which sums across all processors/cores in use) in release mode. Thus, when THREADS = 2 in release mode, you use the total allotment of CPU time twice as fast, doing half as much work on each core.

Just an idea. Can you give more detail on your timing mechanism?

Drew Hall
You're right. I wasn't very clear on that. Yes I have a time limit and each thread uses the same amount of time. This is the call:SetTimer(params->hView, WM_TIMER_EXPIRED, params->nMoveTimeout * 1000, NULL);and I typically set it for 30 seconds. In the 2 thread mode each thread runs its full 30 seconds as I would expect and hope, so I'm quite sure this part is working OK. Each thread is completely independent of the other so there are no synchronization issues. Also, there is very little I/O, just a little output to a window every few seconds.
It's just hard to understand or believe that whatever the problem is that it would: (1) almost exactly double the running time in Release mode, and (2) have no noticeable effect in debug mode.I do appreciate all your thoughts on this!
Again, badly worded. What I mean to say is that it is surprising to me that in Debug mode each of 2 threads running together get exactly as much work done as 1 thread running alone (essentially doubling the total computations, as one would hope), but in release mode each of 2 threads only gets half the work done, and thus not increasing the total computations at all even though both CPUs are rulling full tilt.
My earlier comment is a red herring given your timer mechanism (wall clock based, not cpu-cycle count based). But I wonder if in debug mode it's getting serialized, that is, the debugger only lets one thread run at a time? Just a wild guess. Are you running the debug mode program actually in the debugger, or independently?
Drew Hall
Wouldn't that have the opposite effect? If Debug was getting serialized then running 2 threads in Debug mode for 30 seconds would give the same amount of computation as running 1 thread in 30 seconds. But in Debug I do get double the computations with 2 threads.To answer your question (I hope!) I'm running the debugger from the IDE by hitting the F5 key.Later today I will be near a quad core machine. I'll run some tests there so see get more data, but in any case I still think something strange is going on here.
A: 

The fact that you get 30k positions from both 1 and 2 threads looks suspicious to me. Could that limit come from another component in your system? You mention each thread is totaly independent, but are you by any chance using any of the Interlocked* functions? They look innocent, but they actually force a synchronization of all CPU caches, which can be painful when trying to squeeze the most out of the CPU.

What I would do is have each thread do some dummy action such (string manipulation or so), just to waste some time. If that scales well, add a portion of the thread's real code to the dummy action, and test again. Repeat until the performance stops scaling, which means the latest code addition is the bottleneck.

Another direction I'd look into is making sure both threads are actually running concurrently, on different CPUs. Try bounding each thread to a single CPU. This is not something I'd leave in production, but if your system is loaded by other processes, you might not get the gain you expect from dual CPUs. After all, on a single CPU machine you'll probably get a lower throughput using two thread than what you'd get using one.

eran
A: 

I ran the program on a quad core system and got consistent but still confusing results. I know I am verging on getting away from a specific programming question and becoming a bit abstract, but I'd really like to hear any comments you might have to help explain the numbers I am seeing. For all of these tests I run for 30 seconds and according to task manager all threads are running full power for the entire 30 seconds.

When running in Debug mode, if I run with 1 thread it gets X amount of work done. If I run 2 threads each thread gets X amount of work done. Similarly with 3 and 4 threads. Scaling is perfect.

When running in Release mode, this is what happens:

With 1 thread: it gets Y amount of work done, where Y is nearly double X.

With 2 threads: Each thread gets Y amount of work done. Again, perfect scaling.

With 3 threads: 1 thread gets Y amount of work done, the other 2 threads get 2/3 Y amount of work done. I've lost about 2/3 of a CPU even though one is presumable completely idle. Task Manager shows 75% CPU usage.

With 4 threads: 1 thread gets Y amount of work done. The other 3 threads get 1/2 Y amount of work done. Now I've lost about 1.5 CPU's worth of computing. The Task Manager shows 100% CPU usage.

The obvious questions are:

(1) Repeating the earlier question, was does Debug mode scale so well, but not Release?

(2) Why is one core always able to get full usage but the others seem to fall off? This lack of symmetry is disturbing.

(3) Why are the others falling off? Memory bandwidth was suggested earlier but that seem like an awfully steep price.

Any comments or insights are most welcome. And, as always, thanks!

I think that you need to update your original question with this information and probably post code for the worker threads.
Charles Bailey
A: 

There are many things that may hamper your performance.

One problem might be false sharing of cache lines.

When you have something like :

struct data
{
   int cnt_parsed_thread[THREADS];
   // ...
};
static data;

and in the threads itself :

threadFunc( int threadNum )
{
   while( !end )
   {
      // ...
      // do something
      ++data.cnt_parsed_thread[num];
   }
}

You force both processors to send the cache line after each increment to the other processor, stalling computation enormously.

This problem can be worked around by spreading the falsely shared data into separate cachelines.

e.g. :

struct data
{
   int cnt_parsed_thread[THREADS*CACHELINESIZE];
   // ...

   int& at( int k ) { return cnt_parsed_thread[k*CACHELINESIZE}; }
};

(CACHELINE size should be 64 bytes (I think), maybe play around with that.)

Christopher