views:

526

answers:

13

I realize this is kind of a goofy question, for lack of a better term. I'm just kind of looking for any outside idea on increasing the efficiency of this code, as it's bogging down the system very badly (it has to perform this function a lot) and I'm running low on ideas.

What it's doing it loading two image containers (imgRGB for a full color img and imgBW for a b&w image) pixel-by-individual-pixel of an image that's stored in "unsigned char *pImage".

Both imgRGB and imgBW are containers for accessing individual pixels as necessary.

// input is in the form of an unsigned char
// unsigned char *pImage

for (int y=0; y < 640; y++) {
    for (int x=0; x < 480; x++) {
     imgRGB[y][x].blue = *pImage;
     pImage++;

     imgRGB[y][x].green = *pImage;
     imgBW[y][x]        = *pImage;
     pImage++;

     imgRGB[y][x].red = *pImage;
     pImage++;
    }
}

Like I said, I was just kind of looking for fresh input and ideas on better memory management and/or copy than this. Sometimes I look at my own code so much I get tunnel vision... a bit of a mental block. If anyone wants/needs more information, by all means let me know.

+3  A: 

You could optimize away some of the pointer arithmetic you're doing over and over with the subscript operators [][] and use an iterator instead (that is, advance a pointer).

Assaf Lavie
if those are primitive arrays, the compiler should do that for you... dunno if it's smart enough to do that on some custom type. most likely not.
rmeador
It may not be a net advantage. Continually modifying the same variable means latency becomes an issue. Adding various offsets to the same constant base address allows each index to be computed in parallel.
jalf
A: 

If possible, fix this at a higher level then bit or instruction twiddling!

  • You could specialize the the B&W image class to one that references the green channel of the color image class (thus saving a copy per pixel). If you always create them in pair, you might not even need the naive imgBW class at all.

  • By taking care about how your store the data in imgRGB, you could copy a triplet at a time from the input data. Better, you might copy the whole thing, or even just store a reference (which makes the previous suggestion easy as well).

If you don't control the implementation of everything here, you might be stuck, then:

  • Last resort: unroll the loop (cue someone mentioning Duff's device, or just ask the compiler to do it for you...), though I don't think you'll see much improvement...
dmckee
Just using the green channel seems like a bad idea. Consider an image with a black background and a red rectangle. The green channel would then be a completely black image which is probably undesirable.
kigurai
I just did what the OP asked for. Many real world imaging devices are more sensitive to the whole visual spectrum on the green channel than in other channels, so that is a reasonable approximation to a black and white imaging device. It's not "right" but that won't stop people from using it...
dmckee
I agree that it would probably work in _most_cases, but since we have no clue as to the type of images the OP is working on I thought it was best to point this out.
kigurai
+4  A: 

I think the array accesses (are they real array accesses or operator []?) are going to kill you. Each one represents a multiply.

Basically, you want something like this:

for (int y=0; y < height; y++) {
    unsigned char *destBgr = imgRgb.GetScanline(y); // inline methods are better
    unsigned char *destBW = imgBW.GetScanline(y);
    for (int x=0; x < width; x++) {
        *destBgr++ = *pImage++;
        *destBW++ = *destBgr++ = *pImage++; // do this in one shot - don't double deref
        *destBgr++ = *pImage++;
    }
}

This will do two multiplies per scanline. You code was doing 4 multiplies per PIXEL.

plinth
I think an optimizing compiler will generate the same code with OP's code
on x86 in, the entire array indexing can be done in a single instruction (with 1 cycle's latency, as I recall), and so it won't be a problem. On other platforms, this may be an issue, but I doubt it. It's the memory accesses that are heavy in this code, not the arithmetics.
jalf
Besides, on most modern CPU's, mul and add are equally fast.
jalf
Only for floating point numbers, jalf. integer-multiply's latency is usually 6-10 times that of integer add ( http://www.swox.com/doc/x86-timing.pdf ). On most PowerPCs it's worse since imul is microcoded and that brings everything else in the pipeline to a halt until it's finished.
Crashworks
It's not quite that bad. If you check Intel's docs, a modern x86 has a latency of 3-4 cycles, and of course fully pipelined. Further, many of these multiplications can be replaced with bit shifting at compiletime. But you're right, latency is higher. My bad there. :)
jalf
Any half-decent optimizing compiler will strength-reduce the multiplies into additions (see http://en.wikipedia.org/wiki/Induction_variable_analysis)
Adam Rosenfield
Good - you're all touching on one important point that's often missed: you need to measure with your compiler and your runtime environment.
plinth
A: 

It seems that you defined each pixel as some kind of structure or object. Using a primitive type (say, int) could be faster. As others have mentioned, the compiler is likely to optimize the array access using pointer increments. If the compile doesn't do that for you, you can do that yourself to avoid multiplications when you use array[][].

Since you only need 3 bytes per pixel, you could pack one pixel into one int. By doing that, you could copy 3 bytes a time instead of byte-by-byte. The only tricky thing is when you want to read individual color components of a pixel, you will need some bit masking and shifting. This could give you more overhead than that saved by using an int.

Or you can use 3 int arrays for 3 color components respectively. You will need a lot more storage, though.

PolyThinker
+4  A: 

What I like to do in situations like this is go into the debugger and step through the disassembly to see what it is really doing (or have the compiler generate an assembly listing). This can give you a lot of clues about where inefficencies are. They are often not where you think!

By implementing the changes suggested by Assaf and David Lee above, you can get a before and after instruction count. This really helps me in optimizing tight inner loops.

PhysicalEd
Yep, it's hard to optimize without looking at the generated assembly. +1 for that. Simply going by the instruction count is probably not very reliable though. When you take into account data dependencies, pipelining and instruction reordering, more instructions may lead to faster execution.
jalf
Absolutely right - but its a start. At least you can get an idea of complexity by looking at the compiler output of a few variations. And catch things like overlooked object construction, multiplications etc. - oops! More than that requires the system manuals and pretty deep knowledge IMO.
PhysicalEd
true. But still worth mentioning so people don't blindly trust the instruction count to say everything. :)
jalf
+1  A: 

I'm assuming the following at the moment, so please let me know if my assumptions are wrong:

a) imgRGB is a structure of the type


    struct ImgRGB
    {
      unsigned char blue;
      unsigned char green;
      unsigned char red;
    };

or at least something similar.

b) imgBW looks something like this:


    struct ImgBW
    {
       unsigned char BW;
    };

c) The code is single threaded

Assuming the above, I see several problems with your code:

  • You put the assignment to the BW part right in the middle of the assignments to the other containers. If you're working on a modern CPU, chances are that with the size of your data your L1 cache gets invalidated every time you're switching containers and you're looking at reloading or switching a cache line. Caches are optimised for linear access these days so hopping to and fro doesn't help. Accessing main memory is a lot slower, so that would be a noticeable performance hit. To verify if this is a problem, temporarily I'd remove the assignment to imgBW and measure if there is a noticeable speedup.
  • The array access doesn't help and it'll potentially slow down the code a little, although a decent optimiser should take care of that. I'd probably write the loop along these lines instead, but would not expect a big performance gain. Maybe a couple percent.

    for (int y=0; y blue = *pImage;
            ...
        }
    }
  • For consistency I would change from using postfix to prefix increment but I would not expect to see a big gain.
  • If you can waste a little storage (well, 25%) you might gain from adding a fourth dummy unsigned char to the structure ImgRGB provided that this would increase the size of the structure to the size of an int. Native ints are usually fastest to access and if you're looking at a structure of chars that are not filling up an int completely, you're potentially running into all sorts of interesting access issues that can slow your code down noticeably because the compiler might have to generate additional instructions to extract the unsigned chars. Again, try this and measure the result - it might make a noticeable difference or none at all. In the same vein, upping the size of the structure members from unsigned char to unsigned int might waste lots of space but potentially can speed up the code. Nevertheless as long as pImage is a pointer to an unsigned char, you would only eliminate half the problem.

All in all you are down to making your loop fit to your underlying hardware, so for specific optimisation techniques you might have to read up on what your hardware does well and what it does badly.

Timo Geusch
+2  A: 

You might try using a simple cast to get your RGB data, and just recompute the grayscale data:

#pragma pack(1)
typedef unsigned char bw_t;
typedef struct {
    unsigned char blue;
    unsigned char green;
    unsigned char red;
} rgb_t;
#pragma pack(pop)

rgb_t *imageRGB = (rgb_t*)pImage;
bw_t *imageBW = (bw_t*)calloc(640*480, sizeof(bw_t));
// RGB(X,Y) = imageRGB[Y*480 + X]
// BW(X,Y) = imageBW[Y*480 + X]

for (int y = 0; y < 640; ++y)
{
   // try and pull some larger number of bytes from pImage (24 is arbitrary)
   // 24 / sizeof(rgb_t) = 8
   for (int x = 0; x < 480; x += 24)
   {
       imageBW[y*480 + x    ] = GRAYSCALE(imageRGB[y*480 + x    ]);
       imageBW[y*480 + x + 1] = GRAYSCALE(imageRGB[y*480 + x + 1]);
       imageBW[y*480 + x + 2] = GRAYSCALE(imageRGB[y*480 + x + 2]);
       imageBW[y*480 + x + 3] = GRAYSCALE(imageRGB[y*480 + x + 3]);
       imageBW[y*480 + x + 4] = GRAYSCALE(imageRGB[y*480 + x + 4]);
       imageBW[y*480 + x + 5] = GRAYSCALE(imageRGB[y*480 + x + 5]);
       imageBW[y*480 + x + 6] = GRAYSCALE(imageRGB[y*480 + x + 6]);
       imageBW[y*480 + x + 7] = GRAYSCALE(imageRGB[y*480 + x + 7]);
   }
}
sixlettervariables
+3  A: 

Memory bandwidth is your bottleneck here. There is a theoretical minimum time required to transfer all the data to and from system memory. I wrote a little test to compare the OP's version with some simple assembler to see how good the compiler was. I'm using VS2005 with default release mode settings. Here's the code:

#include <windows.h>
#include <iostream>
using namespace std;

const int
c_width = 640,
c_height = 480;

typedef struct _RGBData
{
  unsigned char
    r,
    g,
    b;
    // I'm assuming there's no padding byte here
} RGBData;

//  similar to the code given
void SimpleTest
(
  unsigned char *src,
  RGBData *rgb,
  unsigned char *bw
)
{
  for (int y = 0 ; y < c_height ; ++y)
  {
    for (int x = 0 ; x < c_width ; ++x)
    {
      rgb [x + y * c_width].b = *src;
      src++;

      rgb [x + y * c_width].g = *src;
      bw [x + y * c_width] = *src;
      src++;

      rgb [x + y * c_width].r = *src;
      src++;
    }
  }
}

//  the assembler version
void ASM
(
  unsigned char *src,
  RGBData *rgb,
  unsigned char *bw
)
{
  const int
    count = 3 * c_width * c_height / 12;

  _asm
  {
    push ebp
    mov esi,src
    mov edi,bw
    mov ecx,count
    mov ebp,rgb
l1:
    mov eax,[esi]
    mov ebx,[esi+4]
    mov edx,[esi+8]
    mov [ebp],eax
    shl eax,16
    mov [ebp+4],ebx
    rol ebx,16
    mov [ebp+8],edx
    shr edx,24
    and eax,0xff000000
    and ebx,0x00ffff00
    and edx,0x000000ff
    or eax,ebx
    or eax,edx
    add esi,12
    bswap eax
    add ebp,12
    stosd
    loop l1
    pop ebp
  }
}

//  timing framework
LONGLONG TimeFunction
(
  void (*function) (unsigned char *src, RGBData *rgb, unsigned char *bw),
  char *description,
  unsigned char *src, 
  RGBData *rgb,
  unsigned char *bw
)
{
  LARGE_INTEGER
    start,
    end;

  cout << "Testing '" << description << "'...";
  memset (rgb, 0, sizeof *rgb * c_width * c_height);
  memset (bw, 0, c_width * c_height);

  QueryPerformanceCounter (&start);

  function (src, rgb, bw);

  QueryPerformanceCounter (&end);

  bool
    ok = true;

  unsigned char
    *bw_check = bw,
    i = 0;

  RGBData
    *rgb_check = rgb;

  for (int count = 0 ; count < c_width * c_height ; ++count)
  {
    if (bw_check [count] != i || rgb_check [count].r != i || rgb_check [count].g != i || rgb_check [count].b != i)
    {
      ok = false;
      break;
    }

    ++i;
  }

  cout << (end.QuadPart - start.QuadPart) << (ok ? " OK" : " Failed") << endl;
  return end.QuadPart - start.QuadPart;
}

int main
(
  int argc,
  char *argv []
)
{
  unsigned char
    *source_data = new unsigned char [c_width * c_height * 3];

  RGBData
    *rgb = new RGBData [c_width * c_height];

  unsigned char
    *bw = new unsigned char [c_width * c_height];

  int
    v = 0;

  for (unsigned char *dest = source_data ; dest < &source_data [c_width * c_height * 3] ; ++dest)
  {
    *dest = v++ / 3;
  }

  LONGLONG
    totals [2] = {0, 0};

  for (int i = 0 ; i < 10 ; ++i)
  {
    cout << "Iteration: " << i << endl;
    totals [0] += TimeFunction (SimpleTest, "Initial Copy", source_data, rgb, bw);
    totals [1] += TimeFunction (       ASM, "    ASM Copy", source_data, rgb, bw);
  }

  LARGE_INTEGER
    freq;

  QueryPerformanceFrequency (&freq);

  freq.QuadPart /= 100000;

  cout << totals [0] / freq.QuadPart << "ns" << endl;
  cout << totals [1] / freq.QuadPart << "ns" << endl;


  delete [] bw;
  delete [] rgb;
  delete [] source_data;

  return 0;
}

And the ratio between C and assembler I was getting was about 2.5:1, i.e. C was 2.5 times the time of the assembler version.

I've just noticed the original data was in BGR order. If the copy swapped the B and R components then it does make the assembler code a bit more complex. But it would also make the C code more complex too.

Ideally, you need to work out what the theoretical minimum time is and compare it to what you're actually getting. To do that, you need to know the memory frequency and the type of memory and the workings of the CPU's MMU.

Skizz

Skizz
+7  A: 

The obvious question is, do you need to copy the data in the first place? Can't you just define accessor functions to extract the R, G and B values for any given pixel from the original input array?

If the image data is transient so you have to keep a copy of it, you could just make a raw copy of it without any reformatting, and again define accessors to index into each pixel/channel on that.

Assuming the copy you outlined is necessary, unrolling the loop a few times may prove to help.

I think the best approach will be to unroll the loop enough times to ensure that each iteration processes a chunk of data divisible by 4 bytes (so in each iteration, the loop can simply read a small number of ints, rather than a large number of chars) Of course this requires you to mask out bits of these ints when writing, but that's a fast operation, and most importantly, it is done in registers, without burdening the memory subsystem or the CPU cache:

// First, we need to treat the input image as an array of ints. This is a bit nasty and technically unportable, but you get the idea)
unsigned int* img = reinterpret_cast<unsigned int*>(pImage);

for (int y = 0; y < 640; ++y)
{
  for (int x = 0; x < 480; x += 4)
  {
    // At the start of each iteration, read 3 ints. That's 12 bytes, enough to write exactly 4 pixels.
    unsigned int i0 = *img;
    unsigned int i1 = *(img+1);
    unsigned int i2 = *(img+2);
    img += 3;

    // This probably won't make a difference, but keeping a reference to the found pixel saves some typing, and it may assist the compiler in avoiding aliasing.
    ImgRGB& pix0 = imgRGB[y][x];
    pix0.blue = i0 & 0xff;
    pix0.green = (i0 >> 8) & 0xff;
    pix0.red = (i0 >> 16) & 0xff;
    imgBW[y][x] = (i0 >> 8) & 0xff;

    ImgRGB& pix1 = imgRGB[y][x+1];
    pix1.blue = (i0 >> 24) & 0xff;
    pix1.green = i1 & 0xff;
    pix1.red = (i0 >> 8) & 0xff;
    imgBW[y][x+1] = i1 & 0xff;

    ImgRGB& pix2 = imgRGB[y][x+2];
    pix2.blue = (i1 >> 16) & 0xff;
    pix2.green = (i1 >> 24) & 0xff;
    pix2.red = i2 & 0xff;
    imgBW[y][x+2] = (i1 >> 24) & 0xff;

    ImgRGB& pix3 = imgRGB[y][x+3];
    pix3.blue = (i2 >> 8) & 0xff;
    pix3.green = (i2 >> 16) & 0xff;
    pix3.red = (i2 >> 24) & 0xff;
    imgBW[y][x+3] = (i2 >> 16) & 0xff;
  }
}

it is also very likely that you're better off filling a temporary ImgRGB value, and then writing that entire struct to memory at once, meaning that the first block would look like this instead: (the following blocks would be similar, of course)

ImgRGB& pix0 = imgRGB[y][x];
ImgRGB tmpPix0;
tmpPix0.blue = i0 & 0xff;
tmpPix0.green = (i0 >> 8) & 0xff;
tmpPix0.red = (i0 >> 16) & 0xff;
imgBW[y][x] = (i0 >> 8) & 0xff;
pix0 = tmpPix0;

Depending on how clever the compiler is, this may cut down dramatically on the required number of reads. Assuming the original code is naively compiled (which is probably unlikely, but will serve as an example), this will get you from 3 reads and 4 writes per pixel (read RGB channel, and write RGB + BW) to 3/4 reads per pixel and 2 writes. (one write for the RGB struct, and one for the BW value)

You could also accumulate the 4 writes to the BW image in a single int, and then write that in one go too, something like this:

bw |= (i0 >> 8) & 0xff;
bw |=  (i1 & 0xff) << 8;
bw |=  ((i1 >> 24) & 0xff) << 16;
bw |=  ((i2 >> 16) & 0xff) << 24;

*(imgBW + y*480+x/4) = bw; // Assuming you can treat imgBW as an array of integers

This would cut down on the number of writes to 1.25 per pixel (1 per RGB struct, and 1 for every 4 BW values)

Again, the benefit will probably be a lot smaller (or even nonexistent), but it may be worth a shot.

Taking this a step further, the same could be done without too much trouble using the SSE instructions, allowing you to process 4 times as many values per iteration. (Assuming you're running on x86)

Of course, an important disclaimer here is that the above is nonportable. The reinterpret_cast is probably an academic point (it'll most likely work no matter what, especially if you can ensure that the original array is aligned on a 32-bit boundary, which will typically be the case for large allocations on all platforms) A bigger issue is that the bit-twiddling depends on the CPU's endianness.

But in practice, this should work on x86. and with small changes, it should work on big-endian machines too. (modulo any bugs in my code, of course. I haven't tested or even compiled any of it ;))

But no matter how you solve it, you're going to see the biggest speed improvements from minimizing the number of reads and writes, and trying to accumulate as much data in the CPU's registers as possible. Read all you can in large chunks, like ints, reorder it in the registers (accumulate it into a number of ints, or write it into temporary instances of the RGB struct), and then write those combined value out to memory.

Depending on how much you know about low-level optimizations, it may be surprising to you, but temporary variables are fine, while direct memory to memory access can be slow (for example your pointer dereferencing assigned directly into the array). The problem with this is that you may get more memory accesses than necessary, and it's harder for the compiler to guarantee that no aliasing will occur, and so it may be unable to reorder or combine the memory accesses. You're generally better off writing as much as you can early on (top of the loop), doing as much as possible in temporaries (because the compiler can keep everything in registers), and then write everything out at the end. That also gives the compiler as much leeway as possible to wait for the initially slow reads.

Finally, adding a 4th dummy value to the RGB struct (so it has a total size of 32bit) will most likely help a lot too (because then writing such a struct is a single 32-bit write, which is simpler and more efficient than the current 24-bit)

When deciding how much to unroll the loop (you could do the above twice or more in each iteration), keep in mind how many registers your CPU has. Spilling out into the cache will probably hurt you as there are plenty of memory accesses already, but on the other hand, unroll as much as you can afford given the number of registers available (the above uses 3 registers for keeping the input data, and one to accumulate the BW values. It may need one or two more to compute the necessary addresses, so on x86, doubling the above might be pushing it a bit (you have 8 registers total, and some of them have special meanings). On the other hand, modern CPU's do a lot to compensate for register pressure, by using a much larger number of registers behind the scenes, so further unrolling might still be a total performance win.

As always, measure measure measure. It's impossible to say what's fast and what isn't until you've tested it.

Another general point to keep in mind is that data dependencies are bad. This won't be a big deal as long as you're only dealing with integral values, but it still inhibits instruction reordering, and superscalar execution. In the above, I've tried to keep dependency chains as short as possible. Rather than continually incrementing the same pointer (which means that each increment is dependant on the previous one), adding a different offset to the same base address means that every address can be computed independently, again giving more freedom to the compiler to reorder and reschedule instructions.

jalf
WRT unrolling the loop - I've had Visual C++ 8's optimizer UNDO Duff's device on me. That was quite a shock.
plinth
hah, impressive. ;)Of course, Duff's device may not necessarily be efficient on today's CPU's, which must be why it did that. Still, impressive that it's able to recognize this and rewrite it.
jalf
+1  A: 

Make sure pImage, imgRGB, and imgBW are marked __restrict. Use SSE and do it sixteen bytes at a time.

Actually from what you're doing there it looks like you could use a simple memcpy() to copy pImage into imgRGB (since imgRGB is in row-major format and apparently in the same order as pImage). You could fill out imgBW by using a series of SSE swizzle and store ops to pack down the green values but it might be cumbersome since you'd need to work on ( 3*16 =) 48 bytes at a time.

Are you sure pImage and your output arrays are all in dcache when you start this? Try using a prefetch hint to fetch 128 bytes ahead and measure to see if that improves things.

Edit If you're not on x86, replace "SSE" with the appropriate SIMD instruction set for your hardware, of course. (That'd be VMX, Altivec, SPU, VLIW, HLSL, etc.)

Crashworks
And SSE is available on an embedded DSP architecture where?
Adam Hawes
Did he mention that he was on an embedded DSP somewhere?
Crashworks
A: 

Here is one very tiny, very simple optimization:

You are referring to imageRGB[y][x] repeatedly, and that likely needs to be re-calculated at each step.

Instead, calculate it once, and see if that makes some improvement:

Pixel* apixel;

for (int y=0; y < 640; y++) {
    for (int x=0; x < 480; x++) {
        apixel = &imgRGB[y][x];

        apixel->blue = *pImage;
        pImage++;

        apixel->green = *pImage;
        imgBW[y][x]   = *pImage;
        pImage++;

        apixel->red = *pImage;
        pImage++;
    }
}
abelenky
Your compiler is smarter than that; it'd only calculate the expression once even in the original code.
Adam Hawes
A: 

If pImage is already entirely in memory, why do you need to massage the data? I mean if it is already in pseudo-RGB format, why can't you just write some inline routines/macros that can spit out the values on demand instead of copying it around?

If rearranging the pixel data is important for later operations, consider block operations and/or cache line optimization.

HUAGHAGUAH
+2  A: 

Several steps you can take. Result at the end of this answer.

First, use pointers.

const unsigned char *pImage;

RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;

for (int y=0; y < 640; ++y) {
    for (int x=0; x < 480; ++x) {
        rgbOut->blue = *pImage;
        ++pImage;

        unsigned char tmp = *pImage;  // Save to reduce amount of reads.
        rgbOut->green = tmp;
        *bwOut = tmp;
        ++pImage;

        rgbOut->red = *pImage;
        ++pImage;

        ++rgbOut;
        ++bwOut;
    }
}


If imgRGB and imgBW are declared as:

unsigned char imgBW[480][640];
RGB imgRGB[480][640];

You can combine the two loops:

const unsigned char *pImage;

RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;

for (int i=0; i < 640 * 480; ++i) {
    rgbOut->blue = *pImage;
    ++pImage;

    unsigned char tmp = *pImage;  // Save to reduce amount of reads.
    rgbOut->green = tmp;
    *bwOut = tmp;
    ++pImage;

    rgbOut->red = *pImage;
    ++pImage;

    ++rgbOut;
    ++bwOut;
}


You can exploit the fact that word reads are faster than four char reads. We will use a helper macro for this. Note this example assumes a little-endian target system.

const unsigned char *pImage;

RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;

const uint32_t *curPixelGroup = pImage;

for (int i=0; i < 640 * 480; ++i) {
    uint64_t pixels = 0;

#define WRITE_PIXEL         \
    rgbOut->blue = pixels;  \
    pixels >>= 8;           \
                            \
    rgbOut->green = pixels; \
    *bwOut = pixels;        \
    pixels >>= 8;           \
                            \
    rgbOut->red = pixels;   \
    pixels >>= 8;           \
                            \
    ++rgbOut;               \
    ++bwOut;

#define READ_PIXEL(shift) \
    pixels |= (*curPixelGroup++) << (shift * 8);

    READ_PIXEL(0);  WRITE_PIXEL;
    READ_PIXEL(1);  WRITE_PIXEL;
    READ_PIXEL(2);  WRITE_PIXEL;
    READ_PIXEL(3);  WRITE_PIXEL;
    /* Remaining */ WRITE_PIXEL;

#undef COPY_PIXELS
}

(Your compiler will probably optimize away the redundant or operation in the first READ_PIXEL. It will also optimize shifts, removing the redundant << 0, too.)


If the structure of RGB is thus:

struct RGB {
     unsigned char blue, green, red;
};

You can optimize even further, copy to the struct directly, instead of through its members (red, green, blue). This can be done using anonymous structs (or casting, but that makes the code a bit more messy and probably more prone to error). (Again, this is dependant on little-endian systems, etc. etc.):

union RGB {
    struct {
        unsigned char blue, green, red;
    };

    uint32_t rgb:24;  // Make sure it's a bitfield, otherwise the union will strech and ruin the ++ operator.
};

const unsigned char *pImage;

RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;

const uint32_t *curPixelGroup = pImage;

for (int i=0; i < 640 * 480; ++i) {
    uint64_t pixels = 0;

#define WRITE_PIXEL         \
    rgbOut->rgb = pixels;   \
    pixels >>= 8;           \
                            \
    *bwOut = pixels;        \
    pixels >>= 16;          \
                            \
    ++rgbOut;               \
    ++bwOut;

#define READ_PIXEL(shift) \
    pixels |= (*curPixelGroup++) << (shift * 8);

    READ_PIXEL(0);  WRITE_PIXEL;
    READ_PIXEL(1);  WRITE_PIXEL;
    READ_PIXEL(2);  WRITE_PIXEL;
    READ_PIXEL(3);  WRITE_PIXEL;
    /* Remaining */ WRITE_PIXEL;

#undef COPY_PIXELS
}

You can optimize writing the pixel similarly as we did with reading (writing in words rather than 24-bits). In fact, that'd be a pretty good idea, and will be a great next step in optimization. Too tired to code it, though. =]


Of course, you can write the routine in assembly language. This makes it less portable than it already is, however.

strager