views:

245

answers:

1

I'm using the threaded version of FFTW (a FFT library) to try to speed up some code on a dual CPU machine. Here is the output of time w/ only 1 thread:

131.838u 1.979s 2:13.91 99.9%

Here it is with 2 threads:

166.261u 30.392s 1:52.67 174.5%

The user times and the CPU load percentages seem to indicate that it is threading pretty effectively, but the wallclock time (which is what I really care about) tells me (I think) that it is taking around 28 extra seconds to deal with the threads. Is that an accurate way to describe the situation? If so, is it fairly normal, or do I probably have something configured incorrectly? Thanks for any light.

+3  A: 

I've used the FFTW a fair amount, and have found that, unless you're going to more than two processors, it's almost always a cleaner solution to just use the single threaded version. It's faster because there's less inter-thread communication, or at least, that's been my experience.

A few things to check out:

  1. Are you configuring your wisdom properly, and using it? Wisdom, once created, will make your transform run much more quickly. If you aren't using it, you should be.
  2. Are you calling the library from one thread, or from two? That was always my problem, was locking multiple thread calls into the library getting to be painful.
  3. How big are your transforms? Are you trying with a small one at first, just to see how it goes, then scaling up?
mmr
Thank you for your help1) I'm not using wisdom - just FFTW_MEASURE2) I'm not manually creating any threads, if that's what you mean?3) I've tried several sizes - right now I'm doing 100 4D transforms of around 130k, but I've tried smaller sizes as well.Here is my function call, if you're interested:[code]int jl[4] = {32,50,16,16};p = fftw_plan_many_dft(4, jl, 100, A, NULL, 1, 32*50*16*16, B, NULL, 1, 32*50*16*16, +1, FFTW_MEASURE);[/code]I was hoping it would thread rather naturally - each CPU handling alternate transforms.Thanks again
Argh, sorry for the dreadful formatting
FFTW_Measure takes the longest time of all the wisdom settings, and could easily account for your problems. Try FFTW_ESTIMATE, because measure is doing a large number of permutations and testing to see which is fastest. That permutation number increases greatly with multiple processors, but if you're saving the wisdom, subsequent transforms should benefit from multiple processors.
mmr
ESTIMATE gives224.651u 2.169s 1:54.75 197.6%However, I put a few printfs in the MEASURE code and it doesn't appear as if the planning stage is taking more than a second. I guess maybe the threading overhead is just too high?Thanks again.
This code is pretty complicated, so unless you're wrapping your calls, it could be that your printfs missed something. Threading overhead does tend to be pretty high, which is why I just don't use it. Also, your size there is very very small for threading, which is why the Estimate return isn't that much different. At this point, I'd see how the threaded and non-threaded performance scales with larger sizes, and then experiment with wisdom saving as you get closer to your target size.
mmr
The size doesn't seem all that small* to me - 32 x 50 x 16 x 16 = 409600, after all (don't know why I had 130k earlier). I'm leaning towards the overhead. Anyways, I hope to have the opportunity to try it on a 16 CPU dual core machine someday soon, so maybe threading will prove more fruitful in that scenario. Thanks again for all your help.*I guess that is small if FFTW always computes N-D FFTS in the classical "row-column" fashion, but I don't know if that is the case.
409k points is very small for an fftw fft. I would throw 512x512x64 image chunks at it routinely; at that size, adding in the fourth dimension would just overwhelm the memory limit of a 32 bit app in windows. Yes, it would take time, but even in that scenario, threading didn't add too much unless wisdom was used (and then the first transform took forever).
mmr