views:

255

answers:

2

Hi,

My Professor found out this interesting experiment of 3D Linearly separable Kernel Convolution using SSE and OpenMP, and gave the task to me to benchmark the statistics on our system. The author claims a crazy 18 fold speedup from the serial approach! Might not be always, but we were expecting at least a 2-4 times speedup running this on a Dual Core Intel.

http://software.intel.com/en-us/articles/16bit-3d-convolution-sse4openmp-implementation-on-penryn-cpu/#comment-41994

Alas, we could find exactly no speedup. The serial code performs always better, with or without OpenMP.

I am using Linux, and observed a certain trend...when no other processes are running on the system, after a while the loadavg starts increasing, and the the %CPU utilization falls down.

Another probable false positive which I ran into accidentally...I started the program, then immediately paused it. Then I ran it on background with bg, and saw a speedup of more than 2. This happens all the time!

Any advice would be great.

Thanks, Sayan

+2  A: 

You really need to profile your program to identify the bottlenecks. You also need to look at optimisation in a more "holistic" way. Your performance issues may be related to poor design, poor coding, memory bandwidth limitations, and a host of other problems, none of which will be addressed by micro-optimisations such as using SIMD instead of scalar code.

Start with a profile (use a tool like Zoom for this) and work from there.

Paul R
A: 

Well I groped around a bit, and then tried the following: I compiled the program using the -O0 option (no optimization) and got a speedup of 2 almost for almost all the XYZ Values. I could also see that 2 threads are utilized on my dual core (previously, it was using only one). But now, when I remove the OpenMP pragmas, I could see no speedup, this bothers me, because SSE should be able to speed things up considerably. So this speedup could be entirely be attributed to OpenMP, have to find out why SSE is failing. Somebody had told me that if operations are trivial (perhaps the weight that this word puts forth is debatable since it differs from person to person), using SSE garners no speedup. But I wrote a small program, that calculates sqrt(i)/i for i_max_size = 64000.....and the SSE version gave a speedup of 3.5 ~ 4.0. I would post more once I find the root cause.

Sayan Ghosh
Well I sort of got the reason. There is a compiler switch -fmpmath=unit that decides whether floating point calculations would be transformed to sse or i387. For x86 machines, it is defaulted to sse.http://gcc.gnu.org/ml/gcc-patches/2004-11/msg02119.html
Sayan Ghosh