views:

228

answers:

5

I've just started programming with POSIX threads on dual-core x86_64 Linux system. It seems that 256 threads is about the optimum for performance with the way I've done it. I'm wondering how this could be? And if it could mean that my approach is wrong and a better approach would require far fewer threads and be just as fast or faster?

For further background (the program in question is a skeleton for a multi-threaded M-set image generator) see the following questions I've asked already:

Using threads, how should I deal with something which ideally should happen in sequential order?

How can my threaded image generating app get it’s data to the gui?

Perhaps I should mention that the skeleton (in which I've reproduced minimal functionality for testing and comparison) is now displaying the image, and the actual calculations are done almost twice as fast as the non-threaded program.

So if 256 threads running faster than 8 threads is not indicative of a poor approach to threading, how come 256 threads does outperform 8 threads?

The speed test case is a portion of the Mandelbrot Set located at:

xmin -0.76243636067708333333333328
xmax -0.7624335575810185185185186
ymax 0.077996663411458333333333929

calculated to a maximum of 30000 iterations.

On the non-threaded version rendering time on my system is around 15 seconds. On the threaded version, averages speed for 8 threads is 7.8 seconds, while 256 threads is 7.6 seconds.

+2  A: 

Could it be your app is io bound? How is the image data generated?

Evert
I don't know if it IO bound. How would I detect that?
James Morris
If the source of your imagedata is read from your harddisk, network, or some other source your threads will be waiting in parallel to read from that source.If the data used to generate the image is 100% computed, or beforehand read into memory there is a good chance it is CPU bound.
Evert
Mandelbrot sets are entirely calculation-generated.
caf
+1  A: 

A performance improvement gained by allocating more threads than cores suggests that the CPU is not the bottleneck. If I/O access such as disk, memory or even network access are involved your results make perfect sense.

Spencer Ruport
+1  A: 

You are probably benefitting from Simultaneous Multithreading (SMT). Your operating system schedules more threads than cores available, and will swap in and out the threads that are not stalled waiting for resources (such as a memory load). This can very effectively hide the latencies of your memory system from your program and is the technique used to great effect for massive parallelization in CUDA for general purpose GPU programming.

Aron Ahmadia
Simultaneous Multithreading is the same as hyperthreading used in newer Intel CPUs. The OP stated he's using a dual-core system, and to my knowledge, there is no modern Intel dual-core hyperthreading CPU.
Jay Conrod
Also, with regard to CUDA, it's possible you're thinking of Single Instruction Multiple Threads (SIMT), which has a similar acronym. I haven't heard of any GPUs using SMT.
Jay Conrod
+4  A: 

Well, probably yes, you're doing something wrong.

However, there are circumstances where 256 threads would run better than 8 without you necessarily having a bad threading model. One must remember that having 8 threads does not mean all 8 threads are actually running all the time. Anytime one thread makes a blocking syscall to the operating system, the thread will stop running and wait for the result. In the meantime, another thread can often do work.

There's this myth that one can't usefully use more threads than contexts on the CPU, but that's just not true. If your threads block on a syscall, it can be critical to have another thread available to do more work. (In practice when threads block there tends to be less work to do, but this is not always the case.)

It's all very dependent on work-load and there's no one right number of threads for any particular application. Generally you never want less threads available than the OS will run, and that's the only true rule. (Unfortunately this can be very hard to find out and so people tend to just fire up as many threads as contexts and then use non-blocking syscalls where possible.)

D.J. Capelis
+1  A: 

If you are seeing a performance increase with the jump to 256 threads, then what you are probably dealing with is a resource bottleneck. At some point, your code is waiting for some slow device (a hard disk or a network connection, for example) in order to continue. With multiple threads, waiting on this slow device isn't a problem because instead of sitting idle and twiddling its electronic thumbs, the CPU can process another thread while the first thread is waiting on the slow device. The more parallel threads that are running, the more work the CPU can do while it is waiting on something else.

If you are seeing performance improve all the way up to 256 threads, I am tempted to say that you have a major performance bottleneck somewhere and it's not the CPU. To test this, try to see if you can measure the idle time of individual threads. I suspect that you will see your threads are stuck in a "blocked" or "waiting" state for a longer portion of their lifetime than they spend in the "running" or "active" state. Some debuggers or function profiling tools will let you do this, and I think there are also Linux tools to do this on the command line.

bta