views:

368

answers:

2

Hello all,

I wrote a C program which reads a dataset from a file and then applies a data mining algorithm to find the clusters and classes in the data. At the moment I am trying to rewrite this sequential program multithreaded with PThreads and I am newbie to a parallel programming and I have a question about the number of worker threads which struggled my mind:

What is the best practice to find the number of worker threads when you do parallel programming and how do you determine it? Do you try different number of threads and see its results then determine or is there a procedure to find out the optimum number of threads. Of course I'm investigating this question from the performance point of view.

+2  A: 

You basically want to have as many ready-to-run threads as you have cores available, or at most 1 or 2 more to ensure no core that's available to you will ever be left idle. The trick is in estimating how many threads will typically be blocked waiting for something else (mostly I/O), as that is totally dependent on your application and even on external entities beyond your control (databases, other distributed services, etc, etc).

In the end, once you've determined about how many threads should be optimal, running benchmarks for thread pool sizes around your estimated value, as you suggest, is good practice (at the very least, it lets you double check your assumptions), especially if, as it appears, you do need to get the last drop of performance out of your system!

Alex Martelli
Thanx Alex , but the estimated number of threads will be machine dependent in that case isn't it. I'm trying to find a portable way. But actually what i am trying to find is some kind of formulation(if there exists).Also after reading your comment I've found the following paper but haven't read it yet:http://portal.acm.org/citation.cfm?id=346152.346320
systemsfault
How can the optimal number of threads be the same on, say, a machine with 2 usable cores, and one with 8? _Of course_ it will depend on the machine; any "formulation" that make things appear otherwise is just going to be **wrong**!-). The article you quote takes into account system performance characteristics, and estimates the workloads based on analyzing server logs (only any use for web or other network based services, but then that's what the article addresses).
Alex Martelli
+1  A: 

There are a couple of issues here.

  1. As Alex says, the number of threads you can use is application-specific. But there are also constraints that come from the type of problem you are trying to solve. Do your threads need to communicate with one another, or can they all work in isolation on individual parts of the problem? If they need to exchange data, then there will be a maximum number of threads beyond which inter-thread communication will dominate, and you will see no further speed-up (in fact, the code will get slower!). If they don't need to exchange data then threads equal to the number of processors will probably be close to optimal.

  2. Dynamically adjusting the thread pool to the underlying architecture for speed at runtime is not an easy task! You would need a whole lot of additional code to do runtime profiling of your functions. See for example the way FFTW works in parallel. This is certainly possible, but is pretty advanced, and will be hard if you are new to parallel programming. If instead the number of cores estimate is sufficient, then trying to determine this number from the OS at runtime and spawning your threads accordingly will be a much easier job.

To answer your question about technique: Most big parallel codes run on supercomputers with a known architecture and take a long time to run. The best number of processors is not just a function of number, but also of the communication topology (how the processors are linked). They therefore benefit from a testing phase where the best number of processors is determined by measuring the time taken on small problems. This is normally done by hand. If possible, profiling should always be preferred to guessing based on theoretical considerations.

ire_and_curses
I wasn't planning to run my code on HPC or Grid. So in my case I'm only planning to use OpenMP and PThreads. So my considerations are SMPs only. I see that it is not an easy task and the current approach that I'm following is doing by estimation and enabling user to change it from configuration file if needed. But this didn't satisfied me, so i wondered if there is an existing better technique.
systemsfault