ansaurus

Question

Answer 1

+1 A:

Instead of the two if statements in the loop, you can calculate the correct minimum/maximum values for i before the loop.
You're calculating each result position separately. Instead, you can split the results array into blocks and have each thread calculate a block. The calculation for a block will look like the convolve function.

interjay 2010-04-18 13:45:44

Answer 2

+2 A:

The most obvious thing that could help would be to pre-compute the starting and ending indices of the loop, and remove the extra tests on i and j (and their associated jumps). This:

for (i = 0; i < in1Len; i++) {
   if (i > outPosition)
     break;
   j = outPosition - i;
   if (j >= in2Len)
     continue;
   *result += in1[i] * in2[j];
}

could be rewritten as:

UInt32 start_i = (in2Len < outPosition) ? outPosition - in2Len + 1 : 0;
UInt32 end_i = (in1Len < outPosition) ? in1Len : outPosition + 1;

for (i = start_i; i < end_i; i++) {
   j = outPosition - i;
   *result += in1[i] * in2[j];
}

This way, the condition j >= in2Len is never true, and the loop test is essentially the combination of the tests i < in1Len and i < outPosition.

In theory you also could get rid of the assignment to j and turn i++ into ++i, but the compiler is probably doing those optimizations for you already.

Tyler McHenry 2010-04-18 13:46:20

Exactly what I was looking for! Thanks Tyler :)

splicer 2010-04-18 14:42:33

`start_i` needs to be signed in `outPosition - inLen` will often be negative

splicer 2010-04-18 14:53:53

In that case then what should really be done is to test if `outPosition > inLen` and if so set `start_i = 0`, since it doesn't make sense to have `i` be negative.

Tyler McHenry 2010-04-18 15:05:36

Hmm.. I just tried your code, and it doesn't work :(I does run quickly though...

splicer 2010-04-18 15:09:28

Sorry, even with your edit it still doesn't work.

splicer 2010-04-18 15:23:00

There is one little bug I just found in it. `end_i` should be set to `outPosition + 1` when `outPosition` is the smaller of the two. Without that it is as if your initial test on `i` had been `i >= outPosition` instead of `i > outPosition`. Fixed in answer.

Tyler McHenry 2010-04-18 15:45:07

Also, don't be so quick to dismiss answers with "it doesn't work". It's your program, not mine, so I'm not spending a whole lot of time sitting here and mentally debugging it. The approach is valid: precompute the loop bounds. If the algebra I used to come up with my formulas for the loop bounds is off slightly, that's something that you should be able to figure out and fix up.

Tyler McHenry 2010-04-18 15:47:28

It looks like `j` will always decrease by 1 each time through the loop, so instead of `j = outPosition - i`, you could start with `j = outPosition` before the loop and then do `j--` each time through. It might be faster depending on your hardware, and it could also reduce the number of registers you need to store your variables.

Noah Lavine 2010-04-18 15:51:09

Fair enough; I'll up-vote you. However, I can't mark your answer as accepted unless your algorithm is sound.

splicer 2010-04-18 16:30:12

Crap.. it says "Vote too old to be changed" when I try to vote you up. Sorry dude.

splicer 2010-04-18 16:31:39

Answer 3

A:

Let the convolve helper work on larger sets, calculating multiple results, using a short outer loop.

The key in parallelization is to find a good balance between the distribution of work between threads. Do no use more threads than the number of CPU cores.

Split the work evenly between all threads. With this kind of problem, the complexity of each threads work should be the same.

Ernelli 2010-04-18 13:47:34

Answer 4

+10 A:

Convolutions in the time domain become multiplications in the Fourier domain. I suggest you grab a fast FFT library (like FFTW) and use that. You'll go from O(n^2) to O(n log n).

Algorithmic optimizations nearly always beat micro-optimizations.

Thomas 2010-04-18 13:47:50

I know, but the problem I'm trying to solve specifically requires convolution in the time domain.

splicer 2010-04-18 13:57:09

Yes, but Thomas argument is that conversion from time- to frequency domain of the input data, doing the convolution using multiplication in O(n) and then converting back to time domain is much more efficient than optmizing the O(n^2) convolution algorithm. You dont have to change the way you represent your data, just convert it as part of performing the convolution.

Ernelli 2010-04-18 14:15:49

Taking the standard FFT approach is definitely more efficient, but it's only an approximation (albeit a very good one) of discrete time convolution. I need to convolve in the time domain because I'm going to be comparing the results to different results achieved with FFT (using various windowing functions and padding with different amounts of zeroes). Yes, this is for a school project ;)

splicer 2010-04-18 14:38:31

But is the running time a big issue then? If you are comparing algorithms/windowing functions, do you have to work on large data sets? In that case, the best "parallelization" would be to run the comparisons on multiple processes/machines.

Ernelli 2010-04-18 15:13:47

Yes, running time is an issue. I'm seeing if humans can tell the difference between the results of different audio convolutions. It's annoying having to wait half an hour between tests ;)

splicer 2010-04-18 15:19:34

@splicer - I'm not sure how the FFT approach is "only an approximation" of convolution - aren't the two techniques absolutely equal? Unless you mean round-off error?

mtrw 2010-04-18 19:20:41

@mtrw: FFT is an implementation of the DFT, not the DTFT. Essentially, the DFT (and FFT) sample the frequency domain. The Wikipedia article on *Spectral Leakage* explains the problem quite well: http://en.wikipedia.org/wiki/Spectral_leakage

splicer 2010-04-18 19:38:33

@splicer - In correlation, you multiply the "leaked" spectra together and then transform back. The leakage is a faithful representation of the time domain waveforms in the other domain. The algorithm is zeropad the signals to newLen = 2*max(in1Len, in2Len), calculate the FFTs, multiply the spectra (complex multiply), inverse FFT, take the first in1Len+in2Len-1 samples as the answer. This is how Matlab/Octave implement their correlation functions internally.

mtrw 2010-04-19 06:25:22

@mtrw: Can it be shown mathematically that this technique produces *exactly* the same result convolving in the time domain? I thought zero padding merely improved the accuracy of the approximation...

splicer 2010-04-19 16:37:48

@Thomas: even though your answer wasn't what I was looking for, I marked it as my accepted answer. If someone sees my question in the future, it's important that they know FFT is the fastest way to perform convolution (even if it's only an approximation).

splicer 2010-04-19 16:42:22

@splicer - It can be shown mathematically, but not by me :) Here's a line of reasoning though, see if you find this convincing: First, assume the two waveforms you're convolving are actually sections of two longer waveforms that are periodic pulses, with deadspace between pulses. Second, see that linearly convolving the pulses is the same as one period of circularly convolving their periodic "parents" because of the deadspace. Third, the DFTs of one period of each signal are exact (ripple and leakage and all), so their product and inverse transform are exact too. Hope that helps!

mtrw 2010-04-19 19:58:52

Answer 5

A:

Unless your arrays are very big, using a thread is unlikely to actually help much, since the overhead of starting a thread will be greater than the cost of the loops. However, let's assume that your arrays are large, and threading is a net win. In that case, I'd do the following:

Forget your current convolveHelper, which is too complicated and won't help much.

Split the interior of the loop into a thread function. I.e. just make

for (j = 0; j < in2Len; j++) {
    results[i+j] += in1[i] * in2[j];
}

into its own function that takes i as a parameter along with everything else.

Have the body of convolve simply launch threads. For maximum efficiency, use a semaphore to make sure that you never create more threads than you have cores.

JSBangs 2010-04-18 13:49:12

Yes, my arrays are fairly large (256KB each). The problem with your approach is that the `+=` operation isn't thread-safe: there many combinations of `i` and `j` which can produce a given `outPosition`.

splicer 2010-04-18 13:55:32

Answer 6

A:

CVS-2600Hertz 2010-04-18 13:58:37

This will give completely different results. It does not perform convolution.

interjay 2010-04-18 14:23:10

Sure, but you are still optimizing an O(n^2) algorithm when O(n log n) is a solution. When n grows, O(n log n) always beat O(n^2). Maybe your data is not large enough to encounter that shift and microoptimization of the O(n^2) still improves the result.

Ernelli 2010-04-18 14:26:33

Wait a minute... that doesn't work!

splicer 2010-04-18 14:30:21

Answer 7

A:

I finally figured out how to correctly precompute the start/end indexes (a suggestion given by both Tyler McHenry and interjay):

if (in1Len > in2Len) {
    if (outPosition < in2Len - 1) {
        start = 0;
        end = outPosition + 1;
    } else if (outPosition >= in1Len) {
        start = 1 + outPosition - in2Len;
        end = in1Len;
    } else {
        start = 1 + outPosition - in2Len;
        end = outPosition + 1;
    }
} else {
    if (outPosition < in1Len - 1) {
        start = 0;
        end = outPosition + 1;
    } else if (outPosition >= in2Len) {
        start = 1 + outPosition - in2Len;
        end = in1Len;
    } else {
        start = 0;
        end = in1Len;
    }
}

for (i = start; i < end; i++) {
    *result = in1[i] * in2[outPosition - i];
}

Unfortunately, precomputing the indexes produces no noticeable decrease in execution time :(

splicer 2010-04-18 19:01:50

ansaurus

tags:

views:

answers:

How can I speed-up this loop (in C)?

related questions