+6  A: 

The code below should be pretty fast. It copies 4 pixels in each iteration, using only 32-bit read/write instructions. The source and destination pointers should be aligned to 32 bits.

uint32_t *src = ...;
uint32_t *dst = ...;

for (int i=0; i<num_pixels; i+=4) {
    uint32_t sa = src[0];
    uint32_t sb = src[1];
    uint32_t sc = src[2];

    dst[i+0] = sa;
    dst[i+1] = (sa>>24) | (sb<<8);
    dst[i+2] = (sb>>16) | (sc<<16);
    dst[i+3] = sc>>8;

    src += 3;
}

Edit:

Here is a way to do this using the SSSE3 instructions PSHUFB and PALIGNR. The code is written using compiler intrinsics, but it shouldn't be hard to translate to assembly if needed. It copies 16 pixels in each iteration. The source and destination pointers Must be aligned to 16 bytes, or it will fault. If they aren't aligned, you can make it work by replacing _mm_load_si128 with _mm_loadu_si128 and _mm_store_si128 with _mm_storeu_si128, but this will be slower.

#include <emmintrin.h>
#include <tmmintrin.h>

__m128i *src = ...;
__m128i *dst = ...;
__m128i mask = _mm_setr_epi8(0,1,2,-1, 3,4,5,-1, 6,7,8,-1, 9,10,11,-1);

for (int i=0; i<num_pixels; i+=16) {
    __m128i sa = _mm_load_si128(src);
    __m128i sb = _mm_load_si128(src+1);
    __m128i sc = _mm_load_si128(src+2);

    __m128i val = _mm_shuffle_epi8(sa, mask);
    _mm_store_si128(dst, val);
    val = _mm_shuffle_epi8(_mm_alignr_epi8(sb, sa, 12), mask);
    _mm_store_si128(dst+1, val);
    val = _mm_shuffle_epi8(_mm_alignr_epi8(sc, sb, 8), mask);
    _mm_store_si128(dst+2, val);
    val = _mm_shuffle_epi8(_mm_alignr_epi8(sc, sc, 4), mask);
    _mm_store_si128(dst+3, val);

    src += 3;
    dst += 4;
}

SSSE3 (not to be confused with SSE3) will require a relatively new processor: Core 2 or newer, and I believe AMD doesn't support it yet. Performing this with SSE2 instructions only will take a lot more operations, and may not be worth it.

interjay
Wow - thanks interjay! It's great to know that I was mistaken, and SIMD operations are available that can do what I need. And that SSSE3 sample is fantastic! I have full control of the platform this code's running on, and can restrict the the hardware choices to SSSE3-capable proc's.
Clippy
+1  A: 

The different input/output sizes are not a barrier to using simd, just a speed bump. You would need to chunk the data so that you read and write in full simd words (16 bytes).

In this case, you would read 3 SIMD words (48 bytes == 16 rgb pixels), do the expansion, then write 4 SIMD words.

I'm just saying you can use SIMD, I'm not saying you should. The middle bit, the expansion, is still tricky since you have non-uniform shift sizes in different parts of the word.

Mark Borgerding
Thanks Mark - It's great to know that I was mistaken, and SIMD operations are available that can do what I needed. It may not be faster, but it's worth me peeking at to make sure. :)
Clippy