I'm working on an old-school "image warp" filter. Essentially, I have a 2D array of pixels (ignoring for the present the question of whether they are color, grayscale, float, RGBA, etc.) and another 2D array of vectors (with floating-point components), with the image being at least as large as the vector array. In pseudo code, I want to do this:
FOR EACH PIXEL (x,y)
vec = vectors[x,y] // Get vector
val = get(img, x + vec.x, y + vec.y) // Get input at <x,y> + vec
output[x,y] = val // Write to output
The catch is that get()
needs to bilinearly sample the input image, because the vectors can refer to subpixel coordinates. But unlike bilinear sampling in, say, texture mapping where we can work the interpolation math into a loop so it's all just adds, here the reads are from random locations. So get()
's definition looks something like this:
FUNCTION get(in,x,y)
ix = floor(x); iy = floor(y) // Integer upper-left coordinates
xf = x - ix; yf = y - iy // Fractional parts
a = in[ix,iy]; b = in[iy+1,iy] // Four bordering pixel values
b = in[ix,iy+1]; d = in[ix+1,iy+1]
ab = lerp(a,b,xf) // Interpolate
cd = lerp(c,d,xf)
RETURN lerp(ab,cd,yf)
and lerp()
is simply
FUNCTION lerp(a,b,x)
RETURN (1-x)*a + x*b
Assuming that neither the input image nor the vector array is known in advance, what kind of high-level optimizations are possible? (Note: "Use the GPU" is cheating.) One thing I can think of is rearranging the interpolation math in get()
so that we can cache the pixel reads and intermediate calculations for a given (ix,iy). That way, if successive accesses are to the same subpixel quad, we can avoid some work. If the vector array is known in advance, then we can rearrange it so that the coordinates passed to get()
tend to be more local. This might help with cache locality as well, but at the expense of having the writes to output
be all over the place. But then we can't do fancy things like scale the vectors on the fly, or even move the warp effect from its original precalculated location.
The only other possibility would be to use fixed-point vector components, perhaps with very limited fractional parts. E.g., if the vectors only have 2-bit fractional components, then there are only 16 subpixel regions that could be accessed. We could precompute the weights for these, and avoid much of the interpolation math altogether, but with a hit to quality.
Any other ideas? I want to accumulate a few different methods before I implement them and see which is the best. If someone could point me to source code of a fast implementation, that would be great.