views:

388

answers:

7

Hi! I'm trying to parallelize a convolution function in C. Here's the original function which convolves two arrays of 64-bit floats:

void convolve(const Float64 *in1,
              UInt32 in1Len,
              const Float64 *in2,
              UInt32 in2Len,
              Float64 *results)
{
    UInt32 i, j;

    for (i = 0; i < in1Len; i++) {
        for (j = 0; j < in2Len; j++) {
            results[i+j] += in1[i] * in2[j];
        }
    }
}

In order to allow for concurrency (without semaphores), I created a function that computes the result for a particular position in the results array:

void convolveHelper(const Float64 *in1,
                    UInt32 in1Len,
                    const Float64 *in2,
                    UInt32 in2Len,
                    Float64 *result,
                    UInt32 outPosition)
{
    UInt32 i, j;

    for (i = 0; i < in1Len; i++) {
        if (i > outPosition)
            break;
        j = outPosition - i;
        if (j >= in2Len)
            continue;
        *result += in1[i] * in2[j];
    }
}

The problem is, using convolveHelper slows down the code about 3.5 times (when running on a single thread).

Any ideas on how I can speed-up convolveHelper, while maintaining thread safety?

+1  A: 
  • Instead of the two if statements in the loop, you can calculate the correct minimum/maximum values for i before the loop.

  • You're calculating each result position separately. Instead, you can split the results array into blocks and have each thread calculate a block. The calculation for a block will look like the convolve function.

interjay
+2  A: 

The most obvious thing that could help would be to pre-compute the starting and ending indices of the loop, and remove the extra tests on i and j (and their associated jumps). This:

for (i = 0; i < in1Len; i++) {
   if (i > outPosition)
     break;
   j = outPosition - i;
   if (j >= in2Len)
     continue;
   *result += in1[i] * in2[j];
}

could be rewritten as:

UInt32 start_i = (in2Len < outPosition) ? outPosition - in2Len + 1 : 0;
UInt32 end_i = (in1Len < outPosition) ? in1Len : outPosition + 1;

for (i = start_i; i < end_i; i++) {
   j = outPosition - i;
   *result += in1[i] * in2[j];
}

This way, the condition j >= in2Len is never true, and the loop test is essentially the combination of the tests i < in1Len and i < outPosition.

In theory you also could get rid of the assignment to j and turn i++ into ++i, but the compiler is probably doing those optimizations for you already.

Tyler McHenry
Exactly what I was looking for! Thanks Tyler :)
splicer
`start_i` needs to be signed in `outPosition - inLen` will often be negative
splicer
In that case then what should really be done is to test if `outPosition > inLen` and if so set `start_i = 0`, since it doesn't make sense to have `i` be negative.
Tyler McHenry
Hmm.. I just tried your code, and it doesn't work :(I does run quickly though...
splicer
Sorry, even with your edit it still doesn't work.
splicer
There is one little bug I just found in it. `end_i` should be set to `outPosition + 1` when `outPosition` is the smaller of the two. Without that it is as if your initial test on `i` had been `i >= outPosition` instead of `i > outPosition`. Fixed in answer.
Tyler McHenry
Also, don't be so quick to dismiss answers with "it doesn't work". It's your program, not mine, so I'm not spending a whole lot of time sitting here and mentally debugging it. The approach is valid: precompute the loop bounds. If the algebra I used to come up with my formulas for the loop bounds is off slightly, that's something that you should be able to figure out and fix up.
Tyler McHenry
It looks like `j` will always decrease by 1 each time through the loop, so instead of `j = outPosition - i`, you could start with `j = outPosition` before the loop and then do `j--` each time through. It might be faster depending on your hardware, and it could also reduce the number of registers you need to store your variables.
Noah Lavine
Fair enough; I'll up-vote you. However, I can't mark your answer as accepted unless your algorithm is sound.
splicer
Crap.. it says "Vote too old to be changed" when I try to vote you up. Sorry dude.
splicer
A: 

Let the convolve helper work on larger sets, calculating multiple results, using a short outer loop.

The key in parallelization is to find a good balance between the distribution of work between threads. Do no use more threads than the number of CPU cores.

Split the work evenly between all threads. With this kind of problem, the complexity of each threads work should be the same.

Ernelli
+10  A: 

Convolutions in the time domain become multiplications in the Fourier domain. I suggest you grab a fast FFT library (like FFTW) and use that. You'll go from O(n^2) to O(n log n).

Algorithmic optimizations nearly always beat micro-optimizations.

Thomas
I know, but the problem I'm trying to solve specifically requires convolution in the time domain.
splicer
Yes, but Thomas argument is that conversion from time- to frequency domain of the input data, doing the convolution using multiplication in O(n) and then converting back to time domain is much more efficient than optmizing the O(n^2) convolution algorithm. You dont have to change the way you represent your data, just convert it as part of performing the convolution.
Ernelli
Taking the standard FFT approach is definitely more efficient, but it's only an approximation (albeit a very good one) of discrete time convolution. I need to convolve in the time domain because I'm going to be comparing the results to different results achieved with FFT (using various windowing functions and padding with different amounts of zeroes). Yes, this is for a school project ;)
splicer
But is the running time a big issue then? If you are comparing algorithms/windowing functions, do you have to work on large data sets? In that case, the best "parallelization" would be to run the comparisons on multiple processes/machines.
Ernelli
Yes, running time is an issue. I'm seeing if humans can tell the difference between the results of different audio convolutions. It's annoying having to wait half an hour between tests ;)
splicer
@splicer - I'm not sure how the FFT approach is "only an approximation" of convolution - aren't the two techniques absolutely equal? Unless you mean round-off error?
mtrw
@mtrw: FFT is an implementation of the DFT, not the DTFT. Essentially, the DFT (and FFT) sample the frequency domain. The Wikipedia article on *Spectral Leakage* explains the problem quite well: http://en.wikipedia.org/wiki/Spectral_leakage
splicer
@splicer - In correlation, you multiply the "leaked" spectra together and then transform back. The leakage is a faithful representation of the time domain waveforms in the other domain. The algorithm is zeropad the signals to newLen = 2*max(in1Len, in2Len), calculate the FFTs, multiply the spectra (complex multiply), inverse FFT, take the first in1Len+in2Len-1 samples as the answer. This is how Matlab/Octave implement their correlation functions internally.
mtrw
@mtrw: Can it be shown mathematically that this technique produces *exactly* the same result convolving in the time domain? I thought zero padding merely improved the accuracy of the approximation...
splicer
@Thomas: even though your answer wasn't what I was looking for, I marked it as my accepted answer. If someone sees my question in the future, it's important that they know FFT is the fastest way to perform convolution (even if it's only an approximation).
splicer
@splicer - It can be shown mathematically, but not by me :) Here's a line of reasoning though, see if you find this convincing: First, assume the two waveforms you're convolving are actually sections of two longer waveforms that are periodic pulses, with deadspace between pulses. Second, see that linearly convolving the pulses is the same as one period of circularly convolving their periodic "parents" because of the deadspace. Third, the DFTs of one period of each signal are exact (ripple and leakage and all), so their product and inverse transform are exact too. Hope that helps!
mtrw
A: 

Unless your arrays are very big, using a thread is unlikely to actually help much, since the overhead of starting a thread will be greater than the cost of the loops. However, let's assume that your arrays are large, and threading is a net win. In that case, I'd do the following:

  • Forget your current convolveHelper, which is too complicated and won't help much.
  • Split the interior of the loop into a thread function. I.e. just make

    for (j = 0; j < in2Len; j++) {
        results[i+j] += in1[i] * in2[j];
    }
    

into its own function that takes i as a parameter along with everything else.

  • Have the body of convolve simply launch threads. For maximum efficiency, use a semaphore to make sure that you never create more threads than you have cores.
JSBangs
Yes, my arrays are fairly large (256KB each). The problem with your approach is that the `+=` operation isn't thread-safe: there many combinations of `i` and `j` which can produce a given `outPosition`.
splicer
A: 
CVS-2600Hertz
This will give completely different results. It does not perform convolution.
interjay
Sure, but you are still optimizing an O(n^2) algorithm when O(n log n) is a solution. When n grows, O(n log n) always beat O(n^2). Maybe your data is not large enough to encounter that shift and microoptimization of the O(n^2) still improves the result.
Ernelli
Wait a minute... that doesn't work!
splicer
A: 

I finally figured out how to correctly precompute the start/end indexes (a suggestion given by both Tyler McHenry and interjay):

if (in1Len > in2Len) {
    if (outPosition < in2Len - 1) {
        start = 0;
        end = outPosition + 1;
    } else if (outPosition >= in1Len) {
        start = 1 + outPosition - in2Len;
        end = in1Len;
    } else {
        start = 1 + outPosition - in2Len;
        end = outPosition + 1;
    }
} else {
    if (outPosition < in1Len - 1) {
        start = 0;
        end = outPosition + 1;
    } else if (outPosition >= in2Len) {
        start = 1 + outPosition - in2Len;
        end = in1Len;
    } else {
        start = 0;
        end = in1Len;
    }
}

for (i = start; i < end; i++) {
    *result = in1[i] * in2[outPosition - i];
}

Unfortunately, precomputing the indexes produces no noticeable decrease in execution time :(

splicer