VS 7.1 release mode does not seem to be properly parallelizing threads while debug mode does. Here is a summary of what is happening.
First, for what it's worth, here is the main piece of code that parallelizes, but I don't think it's an issue:
// parallelize the search
CWinThread* thread[THREADS];
for ( i = 0; i < THREADS; i++ ) {
thread[i] = AfxBeginThread( game_search, ¶llel_params[i],
THREAD_PRIORITY_NORMAL, 0, CREATE_SUSPENDED );
thread[i]->m_bAutoDelete = FALSE;
thread[i]->ResumeThread();
}
for ( i = 0; i < THREADS; i++ ) {
WaitForSingleObject(thread[i]->m_hThread, INFINITE);
delete(thread[i]);
}
THREADS is a global variable that I set and I recompile if I want to change the number of threads. To give a bit of context this is a game playing program that searches game positions.
Here is what happens that doesn't make sense to me.
First, compiling in debug mode. If I set THREADS to 1 the one thread manages to search about 13,000 positions. If I set THREADS to 2, each thread searches about 13,000 positions. Great!
If I compile in release mode and set THREADS to 1 the thread manages to search about 30,000 positions, a typical speedup I'm used to seeing when moving from debug to release. But here is the kicker. When I compile with THREADS = 2 each thread only searches about 15,000 positions. Obviously half of what THREADS = 1 does, so effectively a release compile gives me no effective speedup whatsoever. :(
Watching task manager when these things run, with THREADS = 1 I see 50% CPU usage on my dual core machine and when THREADS = 2 I see 100% CPU usage. But the release compile seems to be giving me an effective CPU usage of 50%. Or something?!
Any thoughts? Is there something I should be setting in the Property Pages?
Update: The following is also posted below but it was suggested I update this post. It was also suggested I post code, but it is a quite large project. I'm hoping others have run into this kind of behavior themselves in the past and can shed some light on what going on.
I ran the program on a quad core system and got consistent but still confusing results. I know I am verging on getting away from a specific programming question and becoming a bit abstract, but I'd really like to hear any comments you might have to help explain the numbers I am seeing. For all of these tests I run for 30 seconds and according to task manager all threads are running full power for the entire 30 seconds.
When running in Debug mode, if I run with 1 thread it gets X amount of work done. If I run 2 threads each thread gets X amount of work done. Similarly with 3 and 4 threads. Scaling is perfect.
When running in Release mode, this is what happens:
With 1 thread: it gets Y amount of work done, where Y is nearly double X.
With 2 threads: Each thread gets Y amount of work done. Again, perfect scaling.
With 3 threads: 1 thread gets Y amount of work done, the other 2 threads get 2/3 Y amount of work done. I've lost about 2/3 of a CPU even though one is presumable completely idle. Task Manager shows 75% CPU usage.
With 4 threads: 1 thread gets Y amount of work done. The other 3 threads get 1/2 Y amount of work done. Now I've lost about 1.5 CPU's worth of computing. The Task Manager shows 100% CPU usage.
The obvious questions are:
(1) Repeating the earlier question, was does Debug mode scale so well, but not Release?
(2) Why is one core always able to get full usage but the others seem to fall off? This lack of symmetry is disturbing.
(3) Why are the others falling off? Memory bandwidth was suggested earlier but that seem like an awfully steep price.
Any comments or insights are most welcome. And, as always, thanks!