I'm working on an embedded device that does not support unaligned memory accesses.
For a video decoder I have to process pixels (one byte per pixel) in 8x8 pixel blocks. The device has some SIMD processing capabilities that allow me to work on 4 bytes in parallel.
The problem is, that the 8x8 pixel blocks aren't guaranteed to start on an aligned address and the functions need to read/write up to three of these 8x8 blocks.
How would you approach this if you want very good performance? After a bit of thinking I came up with the following three ideas:
Do all memory accesses as bytes. This is the easiest way to do it but slow and it does not work well with the SIMD capabilites (it's what I'm currently do in my reference C-code).
Write four copy-functions (one for each alignment case) that load the pixel-data via two 32-bit reads, shift the bits into the correct position and write the data to some aligned chunk of scratch memory. The video processing functions can then use 32 bit accesses and SIMD. Drawback: The CPU will have no chance to hide the memory latency behind the processing.
Same idea as above, but instead of writing the pixels to scratch memory do the video-processing in place. This may be the fastest way, but the number of functions that I have to write for this approach is high (around 60 I guess).
Btw: I will have to write all functions in assembler because the compiler generates horrible code when it comes to the SIMD extension.
Which road would you take, or do you have another idea how to approach this?