views:

553

answers:

3

I have to apply a convolution filter on each row of many images. The classic is 360 images of 1024x1024 pixels. In my use case it is 720 images 560x600 pixels.

The problem is that my code is much slower than what is advertised in articles.

I have implemented the naive convolution, and it takes 2m 30s. I then switched to FFT using fftw. I used complex 2 complex, filtering two rows in each transform. I'm now around 20s.

The thing is that articles advertise around 10s and even less for the classic condition. So I'd like to ask the experts here if there could be a faster way to compute the convolution.

Numerical recipes suggest to avoid the sorting done in the dft and adapt the frequency domain filter function accordingly. But there is no code example how this could be done.

Maybe I lose time in copying data. With real 2 real transform I wouldn't have to copy the data into the complexe values. But I have to pad with 0 anyway.

EDIT: see my own answer below for progress feedback and further information on solving this issue.

Question (precise reformulation):

I'm looking for an algorithm or piece of code to apply a very fast convolution to a discrete non periodic function (512 to 2048 values). Apparently the discrete time Fourier transform is the way to go. Though, I'd like to avoid data copy and conversion to complex, and avoid the butterfly reordering.

A: 

You may want to add image processing as a tag.

But, this article may be of interest, esp with the assumption the image is a power or 2. You can also see where they optimize the FFT. I expect that the articles you are looking at made some assumptions and then optimized the equations for those.

http://www.gamasutra.com/view/feature/3993/sponsored_feature_implementation_.php

If you want to go faster you may want to use the GPU to actually do the work.

This book may be helpful for you, if you go with the GPU: http://www.springerlink.com/content/kd6qm361pq8mmlx2/

James Black
Very interesting reading. I should have added that the FFT computation is performed in parallel to other processing in the GPU card. It is common practice in this domain to perform the FFT on the CPU because it is said to be much faster than the processing performed on the GPU. Unfortunately, in my current situation, the FFT filtering is slower. The GPU processing takes ~15s and the FFT filtering ~20s.
chmike
A: 

This answer is to collect progress report feedback on this issue.

Edit 11 oct.:

The execution time I measured doesn't reflect the effective time of the FFT. I noticed that when my program ends, the CPU is still busy in system time up to 42% for 10s. When I wait until the CPU is back to 0%, before restarting my program I then get the 15.35s execution time which comes from the GPU processing. I get the same time if I comment out the FFT filtering.

So the FFT is in fact currently faster then the GPU and was simply hindered by a competing system task. I don't know yet what this system task is. I suspect it results from the allocation of a huge heap block where I copy the processing result before writing it to disk. For the input data I use a memory map.

I'll now change my code to get an accurate measurement of the FFT processing time. Making it faster is still actuality because there is room to optimize the GPU processing like for instance by pipelining the transfer of data to process.

chmike
+2  A: 

FFT is the fastest technique known for convolving signals, and FFTW is the fastest free library available for computing the FFT.

The key for you to get maximum performance (outside of hardware ... the GPU is a good suggestion) will be to pad your signals to a power of two. When using FFTW use the 'patient' setting when creating your plan to get the best performance. It's highly unlikely that you will hand-roll a faster implementation than what FFTW provides (forget about N.R.). Also be sure to be using the Real version of the forward 1D FFT and not the Complex version; and only use single (floating point) precision if you can.

If FFTW is not cutting it for you, then I would look at Intel's (very affordable) IPP library. The have hand tuned FFT's for Intel processors that have been optimized for images with various bit depths.

Paul
CenterSpace Software

Paul