I hate to ruin anyone's day, but if you don't want to go the IPP route (see photo_tom) or pull in an optimized library, you might get better performance from the following (modifying Andreas answer):
uchar *iplImagePtr = (uchar *) iplImage->imageData;
uchar buf;
size_t limit = height * width;
for (size_t y = 0; y < limit; ++y) {
std::swap(iplImagePtr[y * 3], iplImagePtr[y * 3 + 2]);
}
Now hold on, folks, I hear you yelling "but all those extra multiplies and adds!" The thing is, this form of the loop is far easier for a compiler to optimize, especially if they get smart enough to multithread this sort of algorithm, because each pass through the loop is independent of those before or after. In the other form, the value of iplImagePtr
was dependent on the value in previous pass. In this form, it is constant throughout the whole loop; only y
changes, and that is in a very, very common "count from 0 to N-1" loop construct, so it's easier for an optimizer to digest.
Or maybe it doesn't make a difference these days because optimizers are insanely smart (are they?). I wonder what a benchmark would say...
P.S. If you actually benchmark this, I'd also like to see how well the following performs:
uchar *iplImagePtr = (uchar *) iplImage->imageData;
uchar buf;
size_t limit = height * width;
for (size_t y = 0; y < limit; ++y) {
uchar *pixel = iplImagePtr + y * 3;
std::swap(pix[0], pix[2]);
}
Again, pixel
is defined in the loop to limit its scope and keep the optimizer from thinking there's a cycle-to-cycle dependency. If the compiler increments and decrements the stack pointer each time through the loop to "create" and "destroy" pixel
, well, it's stupid and I'll apologize for wasting your time.