views:

88

answers:

1

I'm working on an old-school "image warp" filter. Essentially, I have a 2D array of pixels (ignoring for the present the question of whether they are color, grayscale, float, RGBA, etc.) and another 2D array of vectors (with floating-point components), with the image being at least as large as the vector array. In pseudo code, I want to do this:

FOR EACH PIXEL (x,y)      
  vec = vectors[x,y]                    // Get vector
  val = get(img, x + vec.x, y + vec.y)  // Get input at <x,y> + vec
  output[x,y] = val                     // Write to output

The catch is that get() needs to bilinearly sample the input image, because the vectors can refer to subpixel coordinates. But unlike bilinear sampling in, say, texture mapping where we can work the interpolation math into a loop so it's all just adds, here the reads are from random locations. So get()'s definition looks something like this:

FUNCTION get(in,x,y)
  ix = floor(x); iy = floor(y)      // Integer upper-left coordinates
  xf = x - ix;   yf = y - iy        // Fractional parts

  a = in[ix,iy];   b = in[iy+1,iy]   // Four bordering pixel values
  b = in[ix,iy+1]; d = in[ix+1,iy+1]

  ab = lerp(a,b,xf)                  // Interpolate
  cd = lerp(c,d,xf)
  RETURN lerp(ab,cd,yf)

and lerp() is simply

FUNCTION lerp(a,b,x) 
  RETURN (1-x)*a + x*b

Assuming that neither the input image nor the vector array is known in advance, what kind of high-level optimizations are possible? (Note: "Use the GPU" is cheating.) One thing I can think of is rearranging the interpolation math in get() so that we can cache the pixel reads and intermediate calculations for a given (ix,iy). That way, if successive accesses are to the same subpixel quad, we can avoid some work. If the vector array is known in advance, then we can rearrange it so that the coordinates passed to get() tend to be more local. This might help with cache locality as well, but at the expense of having the writes to output be all over the place. But then we can't do fancy things like scale the vectors on the fly, or even move the warp effect from its original precalculated location.

The only other possibility would be to use fixed-point vector components, perhaps with very limited fractional parts. E.g., if the vectors only have 2-bit fractional components, then there are only 16 subpixel regions that could be accessed. We could precompute the weights for these, and avoid much of the interpolation math altogether, but with a hit to quality.

Any other ideas? I want to accumulate a few different methods before I implement them and see which is the best. If someone could point me to source code of a fast implementation, that would be great.

+2  A: 

Interesting problem.

Your problem definition has basically enforced unpredictable access to in[x,y] - since any vector might be provided. Assuming the vector image tends to refer to local pixels, the very first optimisation would be to ensure that you traverse the memory in a suitable order to make the most of cache locality. This might mean scanning 32*32 blocks in the "for each pixel" loop so that in[x,y] hits the same pixels as often as possible in a short time.

Most likely the performance of your algorithm is going to be bound by two things

  1. How fast you can load vectors[x,y] and in[x,y] from main memory
  2. How long it takes to do the multiplications and sums

There are SSE instructions that can multiply several elements together at a time and then add them together (multiply and accumulate). What you should do is compute

af = (1 - xf) * ( 1 - yf )
bf = (    xf) * ( 1 - yf )
cf = (1 - xf) * (     yf )
df = (    xf) * (     yf )

and then compute

a *= af
b *= bf 
c *= cf
d *= cf
return (a + b + c + d)

There is a good chance that both of these steps could be accomplished with a surprisingly small number of SSE instructions (depending on your pixel representation).

I think that caching intermediate values is very unlikely to be useful - it seems extremely unlikely that >1% of the vector requests will point to exactly the same place, and caching stuff will cost you much more in memory bandwidth than it will save.

If you use the prefetch instructions on your cpu to prefetch in[vectors[x+1, y]] as you process vectors[x,y] you might improve memory performance, there is no way the CPU will be able to predict your random walk around memory otherwise.

The final way to improve the performance of your algorithm is to handle chunks of input pixels at once, i.e x[0..4], x[5..8] - this lets you unroll the inner maths loops. However, you are so likely to be memory bound that this won't help.

Tom Leys