views:

284

answers:

8

I have a custom thread pool class, that creates some threads that each wait on their own event (signal). When a new job is added to the thread pool, it wakes the first free thread so that it executes the job.

The problem is the following : I have around 1000 loops of each around 10'000 iterations do to. These loops must be executed sequentially, but I have 4 CPUs available. What I try to do is to split the 10'000 iteration loops into 4 2'500 iterations loops, ie one per thread. But I have to wait for the 4 small loops to finish before going to the next "big" iteration. This means that I can't bundle the jobs.

My problem is that using the thread pool and 4 threads is much slower than doing the jobs sequentially (having one loop executed by a separate thread is much slower than executing it directly in the main thread sequentially).

I'm on Windows, so I create events with CreateEvent() and then wait on one of them using WaitForMultipleObjects(2, handles, false, INFINITE) until the main thread calls SetEvent().

It appears that this whole event thing (along with the synchronization between the threads using critical sections) is pretty expensive !

My question is : is it normal that using events takes "a lot of" time ? If so, is there another mechanism that I could use and that would be less time-expensive ?

Here is some code to illustrate (some relevant parts copied from my thread pool class) :

// thread function
unsigned __stdcall ThreadPool::threadFunction(void* params) {
    // some housekeeping
    HANDLE signals[2];
    signals[0] = waitSignal;
    signals[1] = endSignal;

    do {
     // wait for one of the signals
     waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);

     // try to get the next job parameters;
     if (tp->getNextJob(threadId, data)) {
      // execute job
      void* output = jobFunc(data.params);

      // tell thread pool that we're done and collect output
      tp->collectOutput(data.ID, output);
     }

     tp->threadDone(threadId);
    }
    while (waitResult - WAIT_OBJECT_0 == 0);

    // if we reach this point, endSignal was sent, so we are done !

    return 0;
}

// create all threads
for (int i = 0; i < nbThreads; ++i) {
 threadData data;
 unsigned int threadId = 0;
 char eventName[20];

 sprintf_s(eventName, 20, "WaitSignal_%d", i);

 data.handle = (HANDLE) _beginthreadex(NULL, 0, ThreadPool::threadFunction,
        this, CREATE_SUSPENDED, &threadId);
 data.threadId = threadId;
 data.busy = false;
 data.waitSignal = CreateEvent(NULL, true, false, eventName);

 this->threads[threadId] = data;

 // start thread
 ResumeThread(data.handle);
}

// add job
void ThreadPool::addJob(int jobId, void* params) {
    // housekeeping
    EnterCriticalSection(&(this->mutex));

    // first, insert parameters in the list
    this->jobs.push_back(job);

    // then, find the first free thread and wake it
    for (it = this->threads.begin(); it != this->threads.end(); ++it) {
     thread = (threadData) it->second;

     if (!thread.busy) {
      this->threads[thread.threadId].busy = true;

      ++(this->nbActiveThreads);

      // wake thread such that it gets the next params and runs them
      SetEvent(thread.waitSignal);
      break;
     }
    }

    LeaveCriticalSection(&(this->mutex));
}
+1  A: 

The context switching between threads can be expensive too. It is interesting in some cases to develop a framework you can use to process your jobs sequentially with one thread or with multiple threads. This way you can have the best of the two worlds.

By the way, what is your question exactly ? I will be able to answer more precisely with a more precise question :)

EDIT:

The events part can consume more than your processing in some cases, but should not be that expensive, unless your processing is really fast to achieve. In this case, switching between thredas is expensive too, hence my answer first part on doing things sequencially ...

You should look for inter-threads synchronisation bottlenecks. You can trace threads waiting times to begin with ...

EDIT: After more hints ...

If I guess correctly, your problem is to efficiently use all your computer cores/processors to parralellize some processing essencialy sequential.

Take that your have 4 cores and 10000 loops to compute as in your example (in a comment). You said that you need to wait for the 4 threads to end before going on. Then you can simplify your synchronisation process. You just need to give your four threads thr nth, nth+1, nth+2, nth+3 loops, wait for the four threads to complete then going on. You should use a rendezvous or barrier (a synchronization mechanism that wait for n threads to complete). Boost has such a mechanism. You can look the windows implementation for efficiency. Your thread pool is not really suited to the task. The search for an available thread in a critical section is what is killing your CPU time. Not the event part.

neuro
Mmmh, I think my question is about the cost of using events (are they really expensive or am I doing things wrong ?).
Wookai
Yeah, edit your question it will be better ...
neuro
neuro's approach is probably your best bet. Your other choice is to redesign your loops so that they no longer rely on one another, if you can. You may have to pay a perf penalty, but that's ok: code that's x2 slower but scales linearly with the number of hardware threads wins overall, right?
David Seiler
+1  A: 

It shouldn't be that expensive, but if your job takes hardly any time at all, then the overhead of the threads and sync objects will become significant. Thread pools like this work much better for longer-processing jobs or for those that use a lot of IO instead of CPU resources. If you are CPU-bound when processing a job, ensure you only have 1 thread per CPU.

There may be other issues, how does getNextJob get its data to process? If there's a large amount of data copying, then you've increased your overhead significantly again.

I would optimise it by letting each thread keep pulling jobs off the queue until the queue is empty. that way, you can pass a hundred jobs to the thread pool and the sync objects will be used just the once to kick off the thread. I'd also store the jobs in a queue and pass a pointer, reference or iterator to them to the thread instead of copying the data.

gbjbaanb
I had the same optimization idea as you, ie letting threads pull jobs without going through WaitForMultipleObjects(), but in my case I have very few jobs per thread, so this won't change much.
Wookai
I thought you had 2500 per thread? Never mind - the alternative is to check out OpenMP which may well be quicker, and definitely easier to implement. (ie. you just put a pragma before the for loop and let it manage everything for you).
gbjbaanb
+3  A: 

Yes, WaitForMultipleObjects is pretty expensive. If your jobs are small, the synchronization overhead will start to overwhelm the cost of actually doing the job, as you're seeing.

One way to fix this is bundle multiple jobs into one: if you get a "small" job (however you evaluate such things), store it someplace until you have enough small jobs together to make one reasonably-sized job. Then send all of them to a worker thread for processing.

Alternately, instead of using signaling you could use a multiple-reader single-writer queue to store your jobs. In this model, each worker thread tries to grab jobs off the queue. When it finds one, it does the job; if it doesn't, it sleeps for a short period, then wakes up and tries again. This will lower your per-task overhead, but your threads will take up CPU even when there's no work to be done. It all depends on the exact nature of the problem.

David Seiler
The problem is the following : I have around 1000 loops of each around 10'000 iterations do to. These loops must be executed sequentially, but I have 4 CPUs available. What I try to do is to split the 10'000 iteration loops into 4 2'500 iterations loops, ie one per thread. But I have to wait for the 4 small loops to finish before going to the next "big" iteration. This means that I can't bundle the jobs.
Wookai
Put that in the question ;)
neuro
That is the real problem ... See my edit in my answer for my 2 cents ...
neuro
+3  A: 

This looks to me as a producer consumer pattern, which can be implented with two semaphores, one guarding the queue overflow, the other the empty queue.

You can find some details here.

Cătălin Pitiș
Are semaphores less expensive than events ?
Wookai
What does it mean "expensive"? In terms of resources? In terms of kernel time spent to lock/unlock?
Cătălin Pitiș
I mean in terms of time.
Wookai
I don't think there is a difference. Anyway, a difference that can be seen. You can always measure with a profiler.
Cătălin Pitiș
Ok, thanks. Thank you for the link, it's interesting. However, I'm not sure that implementing it using the producer/consumer pattern would speed things up.
Wookai
+2  A: 

Watch out, you are still asking for a next job after the endSignal is emitted.

for( ;; ) {
    // wait for one of the signals
    waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);
    if( waitResult - WAIT_OBJECT_0 != 0 )
        return;
    //....
}
TimW
Thanks for pointing that out. It's not a problem because the endSignal is called when the job list is empty, so it won't get any job and will finish correctly. But you're totally right !
Wookai
A: 

It appears that this whole event thing (along with the synchronization between the threads using critical sections) is pretty expensive !

"Expensive" is a relative term. Are jets expensive? Are cars? or bicycles... shoes...?

In this case, the question is: are events "expensive" relative to the time taken for JobFunction to execute? It would help to publish some absolute figures: How long does the process take when "unthreaded"? Is it months, or a few femtoseconds?

What happens to the time as you increase the threadpool size? Try a pool size of 1, then 2 then 4, etc.

Also, as you've had some issues with threadpools here in the past, I'd suggest some debug to count the number of times that your threadfunction is actually invoked... does it match what you expect?

Picking a figure out of the air (without knowing anything about your target system, and assuming you're not doing anything 'huge' in code you haven't shown), I'd expect the "event overhead" of each "job" to be measured in microseconds. Maybe a hundred or so. If the time taken to perform the algorithm in JobFunction is not significantly MORE than this time, then your threads are likely to cost you time rather than save it.

Roddy
+1  A: 

If you are just parallelizing loops and using vs 2008, I'd suggest looking at OpenMP. If you're using visual studio 2010 beta 1, I'd suggesting looking at the parallel pattern library, particularly the "parallel for" / "parallel for each" apis or the "task group class because these will likely do what you're attempting to do, only with less code.

Regarding your question about performance, here it really depends. You'll need to look at how much work you're scheduling during your iterations and what the costs are. WaitForMultipleObjects can be quite expensive if you hit it a lot and your work is small which is why I suggest using an implementation already built. You also need to ensure that you aren't running in debug mode, under a debugger and that the tasks themselves aren't blocking on a lock, I/O or memory allocation, and you aren't hitting false sharing. Each of these has the potential to destroy scalability.

I'd suggest looking at this under a profiler like xperf the f1 profiler in visual studio 2010 beta 1 (it has 2 new concurrency modes which help see contention) or Intel's vtune.

You could also share the code that you're running in the tasks, so folks could get a better idea of what you're doing, because the answer I always get with performance issues is first "it depends" and second, "have you profiled it."

Good Luck

-Rick

Rick
Thanks for your answer. I'll accept yours as you provide useful links and suggest the use of OpenMP !
Wookai
+1  A: 

Since you say that it is much slower in parallel than sequential execution, I assume that your processing time for your internal 2500 loop iterations is tiny (in the few micro seconds range). Then there is not much you can do except review your algorithm to split larger chunks of precessing; OpenMP won't help and every other synchronization techniques won't help either because they fundamentally all rely on events (spin loops do not qualify).

On the other hand, if your processing time of the 2500 loop iterations is larger than 100 micro seconds (on current PCs), you might be running into limitations of the hardware. If your processing uses a lot of memory bandwidth, splitting it to four processors will not give you more bandwidth, it will actually give you less because of collisions. You could also be running into problems of cache cycling where each of your top 1000 iteration will flush and reload the cache of the 4 cores. Then there is no one solution, and depending on your target hardware, there may be none.

Juice
Thanks for the insights ! OpenMP did help a bit, but it mostly helped by allowing me to get rid of my custom thread pool and rely on something much more reliable.
Wookai
OpenMP probably helped because it uses the current thread for execution. Thus you have 20% less synchronization in your case. Also it is often implemented with a small spin loop before the sleep, so if your execution is fast, it may help to eliminate Events altogether in many cases.
Juice