views:

114

answers:

4

Hi, I've a really strange problem:

I've an application that launches some workers on parallel:

for (it = jobList.begin(); it != jobList.end(); it++) {
    DWORD threadId;
    Job job = *it;
    Worker *worker = new Worker(job);
    workers[i] = worker;
    threads[i++] = CreateThread((LPSECURITY_ATTRIBUTES)NULL, (DWORD)0, &launchThread, worker, (DWORD)0, &threadId);
}
WaitForMultipleObjects((DWORD)jobList.size(), threads, (BOOL)true, (DWORD)INFINITE);

They allocate a bunch of things, so I assume that they synchronize on the new but this is the only place where they eventually synchronize each other.

When I ran the application on a single-core machine, everything is fine; when I launch the application on a multi-core machine, the performances get much worse, worse than that:

for (it = jobList.begin(); it != jobList.end(); it++) {
    DWORD threadId;
    Job job = *it;
    Worker *worker = new Worker(job);
    workers[i] = worker;
    threads[i++] = CreateThread((LPSECURITY_ATTRIBUTES)NULL, (DWORD)0, &launchThread, worker, (DWORD)0, &threadId);
    WaitForSingleObject(threads[i-1], (DWORD)INFINITE);
}

Anyone does have a reasonable guess to give to me?

EDIT:

I have run some tests, and I've found that:

  1. Changing the allocator with the state of the art of parallel allocator doesn't help
  2. The results of the multithreaded application are better on a machine with a Core 2 duo (two cores with a shared L2 cache) than with a dual xeon (two processor with different caches).

I'm thinking that I've in my hands an application with a memory access bottleneck, but... How I can check if this is really the problem, or I should looking at other places?

A: 

how many threads are you creating and how many cores do you have? Anything more than 2 threads per core will DEGRADE performance significantly.

Usually when approaching a problem such as this you you set up a thread pool (ie 4 threads, max, for a dual core machine) and then you assign each thread a "job". When that job completes then you pop the next job off the job list and the thread carries on processing the new job. You do this until every job has been processed. Bear in mind a thread in the pool can DIRECTLY grab the next job off the job list rather than making a worker thread assign the next job to it.

The work flow is as follows.

1) Main thread creates a number of worker threads.
2) Main thread fills job list.
3) Worker thread checks if there are any more jobs and grabs the next from the list.
4) Goto 3 if there are jobs left.
5) Complete (Or return to 2 depending on your requirements).

There are other approaches to the problem but the one above is probably the most efficient and easy to implement.

Goz
This doesn't explain why I get the degradation only if I run it on a machine with more than 1 processor/core.
akappa
None-the-less using a thread pooling system as outlined above is incredibly likely to fix your problems. You didn't answer my question at the start btw ...
Goz
I've an ancient compiler that doesn't let me utilize most of the threadpool and parallelization techniques that I've found. BTW, with a single-core machine, the thread/core ratio is higher and I don't have the slowdown of 2x that I have with a biprocessor, so the reason should lie elsewhere
akappa
It probably does lay elsewhere, that i won't deny. It is very strange though. I wonder if the problem lies directly with your ancient compiler making optimisations that assume 2 things cannot happen at the same time? What compiler are you using?
Goz
Also, have you tried disabling optimisations in your compiler and seeing what happens? It won't necessarily indicate its not a problem with the compiler but if it does fix things then you definitely know its an optimisation problem ...
Goz
+1  A: 

Do your individual Jobs call new as well? new is nearly always thread safe, but usually this safety comes at a huge performance penalty (yes, that paper talks about malloc, but (1) the same issues plague new, and (2) new is often implemented using malloc).

You have a potential memory leak if the call to new Worker(job) fails there is no cleanup code. Of course you may have removed that code in order to post the example or this may be the whole program and you're relying on the operating system to clean up. You may consider using a Scope Guard-like solution for that.

Overall I would recommend looking at something like Intel's Threading Building Blocks or Windows's thread pool. Those should handle a lot of other tricky details (how many threads to create, how to schedule fairly, what data to give them to avoid cache misses, etc.).

Max Lybbert
Unfortunately I use metrowerks, so I cannot use Intel TBB because it isn't supported.Thank you for your suggestion of use Windows's thread pool, I'm unaware of it (I've tried some thread pool libraries, but they incompatible with MW or they are implemented in a ridiculous way)
akappa
+1  A: 

The Interlocked* functions can sometimes go under the radar when looking for synchronization. They are quite fast, but they do force some synchronization and CPU cache update, which will slow you down. Having that said, one would probably have to use those functions a lot to get the impact you're describing.

Having no further details, I would suggest profiling the worker threads, drilling down into the lengthy parts of the code and eventually identifying the bottleneck. Given the effect is significant, the bottleneck should be noticeable. If you find the place but not the reason, update your question with the actual part of the code causing the problem.

eran
+1  A: 

You may try out putting all the dynamic memory allocation before creating any worker thread. dynamic memory allocation access a heap which requires critical section access. which currently new allocation will be doing. As you are having multiple threads running then those threads will get a lot of time to execute and if those worker threads are doing some dynamic allocation of memory, then your main thread may get a little time to allocate dynamic memory because it will have to wait for other threads to come out of new.