One potential speedup might be to use binary operators. For example, you could march through A XOR B
for subsequent, overlapping regions of A. The resulting region whose values are closest to 0 would be the portion of A that most closely resembles B. If you had to take the alpha mask into account assume A's alpha mask is all 1s and include it in the XOR- 32 bits per pixel instead of 24.
I'd consider moving your early out on cur Difference into your inner loop so that it can short circuit well before the inner loop is done if the error is already too great. Your trading ifs for some heavy math. Also, the pixel scale value on the error could be a multiply instead of a divide (minor on new machines.)
Any chance of reading multiple pixels at a time or paralleling their processing?
For threading, you could start threads per each external for loop iteration (breaking up into how ever many threads you want to use) to allow your CPUs to be more efficicent. Syncing max error would be the only issue - which could be accomplished by storing the errors into an external table and comparing at the end to prevent memory contention.
Caching your structures to get rid of -> 's can help, but the compiler usually does this for you.
Just some thoughts to begin with. Still looking...
You may try to find approximate solution: Patch Match
This paper presents interactive image editing tools using a new randomized algorithm for quickly finding approximate nearest neighbor matches between image patches. Previous research in graphics and vision has leveraged such nearest-neighbor searches to provide a variety of high-level digital image editing tools. However, the cost of computing a field of such matches for an entire image has eluded previous efforts to provide interactive performance. Our algorithm offers substantial performance improvements over the previous state of the art (20-100x), enabling its use in interactive editing tools.
You'll want to look at motion estimation, which is used in video coding to find the location of a block in a previously coded picture, that most closely resembles the block to be coded.
(NOTE: I don't have enough reputation to post 2 links, so you'll have to look up motion estimation in wikipedia).
Some simple block matching algorithms can be found here. These work by only analyzing a subset of points in the search area.
If you want to find the specific point that minimizes your matching function, you'll have to do a full search. Full-search speedup is usually achieved by early termination - ending the evaluation of a point if it is already impossible for it to improve upon the previous best result.
min_sad = INT_MAX // minimum sum of absolute difference
min_point = {0, 0}
foreach (point p : all_search_points )
{
sad = 0
for( y = 0; y < block_height; ++y )
for( x = 0; x < block_width && sad < min_sad; ++x ):
sad += abs( block_b[y,x] - block_a[p.y+y, p.x+x] )
if( sad < min_sad )
min_sad = sad
min_point = p
}
Early termination is also useful when examining only a subset of search points, though the speedup isn't as great as with full search.
PDiff is an open source perceptual image difference tool that maybe has some helpful techniques for you.
Answering to my own question.
Short answer: I was able to drop alpha channel, so I've decided to use image pyramids (see pyramid and gaussian pyramid on the net). It gave huge speed improvement.
Long answer:
My initial goal was texture synthesis. Alpha was used to generating pixels that weren't filled yet, and B represented a portion of already generated image. (I.e. A was sample pattern, and B was generated image)
After a bit of researching I've found that either there is no quick way to do a search in N-dimensional space (for example, 3x3 pixel area is basically an 24 component vector (center pixel excluded), while 7x7 wlil be 144-component, searching for such area will be 24-dimensional or 144-dimensional search). Well, there are ways (for example, paper named "I-COLLIDE: an interactive and exact collision detection system for large-scale environments" uses 3 sorted arrays (each sorted on different dimension) to do 3 dimensional search), but they obviously will work better for floats and lower number of dimensions.
Suggestion to use motion detection wasn't useful, because (it seems) motion detection assumes that pixels represent moving objects (not true in my case), and at least some optimization relies on that.
In the end I've found paper named "Fast Texture Synthesis using Tree-structured Vector Quantization" (Li-Yi Wei, Marc Levoy, Stanford University), which uses technique based on algorithm similar to the one I used. Image to be searched is downsampled several times (similar to mip-map generation), search performed on the lowest level first, and then on the next. It may not be the best way to do actual image search for other application, but it perfect for my purposes. The paper is relatively old but it works for me.
The same paper mentions a few techniques for accelerating search even further. One of them is "Tree-structured vector quantization (TSVQ)", although I can't give more info about it (haven't checked it - current texture generator works with acceptable speed on my hardware, so I probably won't look into further optimizations).