I'm working on a thing where I want to have the output option to go to a video overlay. Some support rgb565, If so sweet, just copy the data across.
If not I have to copy data across with a conversion and it's a frame buffer at a time. I'm going to try a few things, but I thought this might be one of those things that optimisers would be keen on having a go at for a bit of a challenge.
There a variety of YUV formats that are commonly supported easiest would be the Plane of Y followed by either interleaved or individual planes of UV.
Using Linux / xv, but at the level I'm dealing with it's just bytes and an x86.
I'm going to focus on speed at the cost of quality, but there are potentially hundreds of different paths to try out. There's a balance in there somewhere.
I looked at mmx but I'm not sure if there is anything useful there. There's nothing that strikes me as particularly suited to the task and it's a lot of shuffling to get things into the right place in registers.
Trying a crude version with Y = Green*0.5 + R*0.25 + Blue*notmuch. The U and V are even less of a concern quality wise. You can get away with murder on those channels.
For a simple loop.
loop:
movzx eax,[esi]
add esi,2
shr eax,3
shr al,1
add ah,ah
add al,ah
mov [edi],al
add edi,1
dec count
jnz loop
of course every instruction depends on the one before and word reads aren't the best so interleaving two might gain a bit
loop:
mov eax,[esi]
add esi,4
mov ebx,eax
shr eax,3
shr ebx,19
add ah,ah
add bh,bh
add al,ah
add bl,bh
mov ah,bl
mov [edi],ax
add edi,2
dec count
jnz loop
It would be quite easy to do that with 4 at a time, maybe for a benefit.
Can anyone come up with anything faster, better?
An interesting side point to this is whether or not a decent compiler can produce similar code.