The obvious question is, do you need to copy the data in the first place? Can't you just define accessor functions to extract the R, G and B values for any given pixel from the original input array?
If the image data is transient so you have to keep a copy of it, you could just make a raw copy of it without any reformatting, and again define accessors to index into each pixel/channel on that.
Assuming the copy you outlined is necessary, unrolling the loop a few times may prove to help.
I think the best approach will be to unroll the loop enough times to ensure that each iteration processes a chunk of data divisible by 4 bytes (so in each iteration, the loop can simply read a small number of ints, rather than a large number of chars)
Of course this requires you to mask out bits of these ints when writing, but that's a fast operation, and most importantly, it is done in registers, without burdening the memory subsystem or the CPU cache:
// First, we need to treat the input image as an array of ints. This is a bit nasty and technically unportable, but you get the idea)
unsigned int* img = reinterpret_cast<unsigned int*>(pImage);
for (int y = 0; y < 640; ++y)
{
for (int x = 0; x < 480; x += 4)
{
// At the start of each iteration, read 3 ints. That's 12 bytes, enough to write exactly 4 pixels.
unsigned int i0 = *img;
unsigned int i1 = *(img+1);
unsigned int i2 = *(img+2);
img += 3;
// This probably won't make a difference, but keeping a reference to the found pixel saves some typing, and it may assist the compiler in avoiding aliasing.
ImgRGB& pix0 = imgRGB[y][x];
pix0.blue = i0 & 0xff;
pix0.green = (i0 >> 8) & 0xff;
pix0.red = (i0 >> 16) & 0xff;
imgBW[y][x] = (i0 >> 8) & 0xff;
ImgRGB& pix1 = imgRGB[y][x+1];
pix1.blue = (i0 >> 24) & 0xff;
pix1.green = i1 & 0xff;
pix1.red = (i0 >> 8) & 0xff;
imgBW[y][x+1] = i1 & 0xff;
ImgRGB& pix2 = imgRGB[y][x+2];
pix2.blue = (i1 >> 16) & 0xff;
pix2.green = (i1 >> 24) & 0xff;
pix2.red = i2 & 0xff;
imgBW[y][x+2] = (i1 >> 24) & 0xff;
ImgRGB& pix3 = imgRGB[y][x+3];
pix3.blue = (i2 >> 8) & 0xff;
pix3.green = (i2 >> 16) & 0xff;
pix3.red = (i2 >> 24) & 0xff;
imgBW[y][x+3] = (i2 >> 16) & 0xff;
}
}
it is also very likely that you're better off filling a temporary ImgRGB value, and then writing that entire struct to memory at once, meaning that the first block would look like this instead: (the following blocks would be similar, of course)
ImgRGB& pix0 = imgRGB[y][x];
ImgRGB tmpPix0;
tmpPix0.blue = i0 & 0xff;
tmpPix0.green = (i0 >> 8) & 0xff;
tmpPix0.red = (i0 >> 16) & 0xff;
imgBW[y][x] = (i0 >> 8) & 0xff;
pix0 = tmpPix0;
Depending on how clever the compiler is, this may cut down dramatically on the required number of reads.
Assuming the original code is naively compiled (which is probably unlikely, but will serve as an example), this will get you from 3 reads and 4 writes per pixel (read RGB channel, and write RGB + BW) to 3/4 reads per pixel and 2 writes. (one write for the RGB struct, and one for the BW value)
You could also accumulate the 4 writes to the BW image in a single int, and then write that in one go too, something like this:
bw |= (i0 >> 8) & 0xff;
bw |= (i1 & 0xff) << 8;
bw |= ((i1 >> 24) & 0xff) << 16;
bw |= ((i2 >> 16) & 0xff) << 24;
*(imgBW + y*480+x/4) = bw; // Assuming you can treat imgBW as an array of integers
This would cut down on the number of writes to 1.25 per pixel (1 per RGB struct, and 1 for every 4 BW values)
Again, the benefit will probably be a lot smaller (or even nonexistent), but it may be worth a shot.
Taking this a step further, the same could be done without too much trouble using the SSE instructions, allowing you to process 4 times as many values per iteration. (Assuming you're running on x86)
Of course, an important disclaimer here is that the above is nonportable. The reinterpret_cast is probably an academic point (it'll most likely work no matter what, especially if you can ensure that the original array is aligned on a 32-bit boundary, which will typically be the case for large allocations on all platforms)
A bigger issue is that the bit-twiddling depends on the CPU's endianness.
But in practice, this should work on x86. and with small changes, it should work on big-endian machines too. (modulo any bugs in my code, of course. I haven't tested or even compiled any of it ;))
But no matter how you solve it, you're going to see the biggest speed improvements from minimizing the number of reads and writes, and trying to accumulate as much data in the CPU's registers as possible. Read all you can in large chunks, like ints, reorder it in the registers (accumulate it into a number of ints, or write it into temporary instances of the RGB struct), and then write those combined value out to memory.
Depending on how much you know about low-level optimizations, it may be surprising to you, but temporary variables are fine, while direct memory to memory access can be slow (for example your pointer dereferencing assigned directly into the array). The problem with this is that you may get more memory accesses than necessary, and it's harder for the compiler to guarantee that no aliasing will occur, and so it may be unable to reorder or combine the memory accesses. You're generally better off writing as much as you can early on (top of the loop), doing as much as possible in temporaries (because the compiler can keep everything in registers), and then write everything out at the end. That also gives the compiler as much leeway as possible to wait for the initially slow reads.
Finally, adding a 4th dummy value to the RGB struct (so it has a total size of 32bit) will most likely help a lot too (because then writing such a struct is a single 32-bit write, which is simpler and more efficient than the current 24-bit)
When deciding how much to unroll the loop (you could do the above twice or more in each iteration), keep in mind how many registers your CPU has. Spilling out into the cache will probably hurt you as there are plenty of memory accesses already, but on the other hand, unroll as much as you can afford given the number of registers available (the above uses 3 registers for keeping the input data, and one to accumulate the BW values. It may need one or two more to compute the necessary addresses, so on x86, doubling the above might be pushing it a bit (you have 8 registers total, and some of them have special meanings). On the other hand, modern CPU's do a lot to compensate for register pressure, by using a much larger number of registers behind the scenes, so further unrolling might still be a total performance win.
As always, measure measure measure. It's impossible to say what's fast and what isn't until you've tested it.
Another general point to keep in mind is that data dependencies are bad. This won't be a big deal as long as you're only dealing with integral values, but it still inhibits instruction reordering, and superscalar execution.
In the above, I've tried to keep dependency chains as short as possible. Rather than continually incrementing the same pointer (which means that each increment is dependant on the previous one), adding a different offset to the same base address means that every address can be computed independently, again giving more freedom to the compiler to reorder and reschedule instructions.