A: 

One potential speedup might be to use binary operators. For example, you could march through A XOR B for subsequent, overlapping regions of A. The resulting region whose values are closest to 0 would be the portion of A that most closely resembles B. If you had to take the alpha mask into account assume A's alpha mask is all 1s and include it in the XOR- 32 bits per pixel instead of 24.

fbrereto
Hmm. I'll think about it, but frankly I was looking for different/better search algorithm. Plus, the fastest implementation of that thing I have uses OpenGL GLSL shaders, and they operate on floats internally (no XOR).
SigTerm
@SigTerm: You said both images are 24 bit RGB - I may be missing something but how does having those values stored in floating point make for better performace?
fbrereto
@fbrereto: I was talking about GPU(videocard) accelerated implementation that uses shaders for bruteforce search. From within shader code you "see" any color as 4component vectors of floats (rgba), and you can't performa XOR operations on them. You may get better performance, because GPU is optimized for certain types of calculations and faster texture memory access. Unfortunately brute force search doesn't benefit from them - because random texture memory access very quickly brings performance down.
SigTerm
@fbrereto: Also, it looks like I was wrong and GPU/OpenGL implementation works slower on my machine than single-threaded CPU version. This probably happens because it isn't top-notch gaming videocard and because implementation wasn't efficient (using shaders instead of CUDA wasn't a good idea).
SigTerm
A: 

I'd consider moving your early out on cur Difference into your inner loop so that it can short circuit well before the inner loop is done if the error is already too great. Your trading ifs for some heavy math. Also, the pixel scale value on the error could be a multiply instead of a divide (minor on new machines.)

Any chance of reading multiple pixels at a time or paralleling their processing?

For threading, you could start threads per each external for loop iteration (breaking up into how ever many threads you want to use) to allow your CPUs to be more efficicent. Syncing max error would be the only issue - which could be accomplished by storing the errors into an external table and comparing at the end to prevent memory contention.

Caching your structures to get rid of -> 's can help, but the compiler usually does this for you.

Just some thoughts to begin with. Still looking...

Michael Dorgan
+1  A: 

You may try to find approximate solution: Patch Match

This paper presents interactive image editing tools using a new randomized algorithm for quickly finding approximate nearest neighbor matches between image patches. Previous research in graphics and vision has leveraged such nearest-neighbor searches to provide a variety of high-level digital image editing tools. However, the cost of computing a field of such matches for an entire image has eluded previous efforts to provide interactive performance. Our algorithm offers substantial performance improvements over the previous state of the art (20-100x), enabling its use in interactive editing tools.

Ross
Interesting, but I've found another solution.
SigTerm
+2  A: 

You'll want to look at motion estimation, which is used in video coding to find the location of a block in a previously coded picture, that most closely resembles the block to be coded.

(NOTE: I don't have enough reputation to post 2 links, so you'll have to look up motion estimation in wikipedia).

Some simple block matching algorithms can be found here. These work by only analyzing a subset of points in the search area.

If you want to find the specific point that minimizes your matching function, you'll have to do a full search. Full-search speedup is usually achieved by early termination - ending the evaluation of a point if it is already impossible for it to improve upon the previous best result.

min_sad = INT_MAX  // minimum sum of absolute difference
min_point = {0, 0}

foreach (point p : all_search_points )
{         
    sad = 0
    for( y = 0; y < block_height; ++y )
        for( x = 0; x < block_width && sad < min_sad; ++x ):
            sad += abs( block_b[y,x] - block_a[p.y+y, p.x+x] )
        if( sad < min_sad )
            min_sad = sad
            min_point = p
}

Early termination is also useful when examining only a subset of search points, though the speedup isn't as great as with full search.

ganz
Sounds promising, I'll check it out. Terminating search is also a good idea.
SigTerm
A: 

PDiff is an open source perceptual image difference tool that maybe has some helpful techniques for you.

Marcus Lindblom
+1  A: 

Answering to my own question.

Short answer: I was able to drop alpha channel, so I've decided to use image pyramids (see pyramid and gaussian pyramid on the net). It gave huge speed improvement.

Long answer:

My initial goal was texture synthesis. Alpha was used to generating pixels that weren't filled yet, and B represented a portion of already generated image. (I.e. A was sample pattern, and B was generated image)

After a bit of researching I've found that either there is no quick way to do a search in N-dimensional space (for example, 3x3 pixel area is basically an 24 component vector (center pixel excluded), while 7x7 wlil be 144-component, searching for such area will be 24-dimensional or 144-dimensional search). Well, there are ways (for example, paper named "I-COLLIDE: an interactive and exact collision detection system for large-scale environments" uses 3 sorted arrays (each sorted on different dimension) to do 3 dimensional search), but they obviously will work better for floats and lower number of dimensions.

Suggestion to use motion detection wasn't useful, because (it seems) motion detection assumes that pixels represent moving objects (not true in my case), and at least some optimization relies on that.

In the end I've found paper named "Fast Texture Synthesis using Tree-structured Vector Quantization" (Li-Yi Wei, Marc Levoy, Stanford University), which uses technique based on algorithm similar to the one I used. Image to be searched is downsampled several times (similar to mip-map generation), search performed on the lowest level first, and then on the next. It may not be the best way to do actual image search for other application, but it perfect for my purposes. The paper is relatively old but it works for me.

The same paper mentions a few techniques for accelerating search even further. One of them is "Tree-structured vector quantization (TSVQ)", although I can't give more info about it (haven't checked it - current texture generator works with acceptable speed on my hardware, so I probably won't look into further optimizations).

SigTerm