views:

65

answers:

1

I want to implement some image-processing algorithms which are intended to run on a beagleboard. These algorithms use convolutions extensively. I'm trying to find a good C implementation for 2D convolution (probably using the Fast Fourier Transform). I also want the algorithm to be able to run on the beagleboard's DSP, because I've heard that the DSP is optimized for these kinds of operations (with its multiply-accumulate instruction).

I have no background in the field so I think it won't be a good idea to implement the convolution myself (I probably won't do it as good as someone who understands all the math behind it). I believe a good C convolution implementation for DSP exists somewhere but I wasn't able find it?

Could someone help?

EDIT: Turns out the kernel is pretty small. Its dimensions are either 2X2 or 3X3. So I guess I'm not looking for an FFT-based implementation. I was searching for convolution on the web to see its definition so I can implement it in a straight forward way (I don't really know what convolution is). All I've found is something with multiplied integrals and I have no idea how to do it with matrices. Could somebody give me a piece of code (or pseudo code) for the 2X2 kernel case?

+2  A: 

What are the dimensions of the image and the kernel ? If the kernel is large then you can use FFT-based convolution, otherwise for small kernels just use direct convolution.

The DSP might not be the best way to do this though - just because it has a MAC instruction doesn't mean that it will be more efficient. Does the ARM CPU on the Beagle Board have NEON SIMD ? If so then that might be the way to go (and more fun too).

For a small kernel, you can do direct convolution like this:

// in, out are m x n images (integer data)
// K is the kernel size (KxK) - currently needs to be an odd number, e.g. 3
// coeffs[K][K] is a 2D array of integer coefficients
// scale is a scaling factor to normalise the filter gain

for (i = K / 2; i < m - K / 2; ++i) // iterate through image
{
  for (j = K / 2; j < n - K / 2; ++j)
  {
    int sum = 0; // sum will be the sum of input data * coeff terms

    for (ii = - K / 2; ii <= K / 2; ++ii) // iterate over kernel
    {
      for (jj = - K / 2; jj <= K / 2; ++jj)
      {
        int data = in[i + ii][j +jj];
        int coeff = coeffs[ii + K / 2][jj + K / 2];

        sum += data * coeff;
      }
    }
    out[i][j] = sum / scale; // scale sum of convolution products and store in output
  }
}

You can modify this to support even values of K - it just takes a little care with the upper/lower limits on the two inner loops.

Paul R
@Paul: I'd imagine that for straight-up convolutions, the TI would beat the ARM, given modulo addressing modes and zero-overhead loops and so on.
Oli Charlesworth
@Oli: you may be right - I'm not sure that modulo addressing helps, but there may be other advantages.
Paul R
@MPaul: Modulo addressing helps if you're zipping through a circular buffer. I have no numbers to back this up, but if the TI can't beat the host processor at the canonical thing it's designed for, then it's a pretty pointless combination!
Oli Charlesworth
@Oli: it would certainly be interesting to try a routine such as this on both the ARM and the DSP and see which comes out faster. One advantage of doing it on the DSP of course is that it leaves the ARM free for other tasks. I guess the pragmatic solution though would be to see which, if any, has existing library code for this that you might be able to leverage for this task and then go with that. Most DSPs have 1D and 2D signal processing libraries so that would probably be the first thing to look at.
Paul R
@Paul: Good point with "doing it on the DSP leaves the ARM free".
Oli Charlesworth
@Paul: Thanks for the things you said. Do you know where can I find these DSP signal processing libraries you mentioned?
snakile
@Paul: I can't give an exact answer for the dimensions question. Th algorithms I plan to implement come from an academic paper which doesn't say anything about the image dimentions. But I guess I'll want it to operate on 512X512 images.
snakile
@snakikle: What's more important is the kernel, i.e. the thing you'll be convolving the image with. As Paul says, if this is "small", then it's more efficient to use standard convolution.
Oli Charlesworth
@snakile: I believe the DSP on the Beagleboard is from the TI C64X family, so you might want to look at the TI web site or any other TI DSP-related sites to see what ready-rolled routines are available. You'll need a C64X toolchain to build this code of course (see http://elinux.org/BeagleBoard/DSP_Howto), and it's probably going to be a lot harder to test/debug than regular ARM CPU code, but that's half the fun of course.
Paul R
@Oli, @Paul: Thank you both. Your remarks were helpful. Now it turns out the kernel is small, so as Paul said it doesn't make sense to use FFT. I edited my question. Could you please help me a bit more?
snakile
@snakile: no problem - I've edited my answer to show how to implement a convolution with a KxK kernel with integer data and coefficients.
Paul R