Can someone explain to me how I convert a 32-bit floating point value to a 16-bit floating point value?
(s = sign e = exponent and m = mantissa)
If 32-bit float is 1s7e24m
And 16-bit float is 1s5e10m
Then is it as simple as doing?
int fltInt32;
short fltInt16;
memcpy( &fltInt32, &flt, sizeof( float ) );
fltInt16 = (fltInt32 & 0x00FFFFFF) >> 14;
fltInt16 |= ((fltInt32 & 0x7f000000) >> 26) << 10;
fltInt16 |= ((fltInt32 & 0x80000000) >> 16);
I'm assuming it ISN'T that simple ... so can anyone tell me what you DO need to do?
Edit: I cam see I've got my exponent shift wrong ... so would THIS be better?
fltInt16 = (fltInt32 & 0x007FFFFF) >> 13;
fltInt16 |= (fltInt32 & 0x7c000000) >> 13;
fltInt16 |= (fltInt32 & 0x80000000) >> 16;
I'm hoping this is correct. Apologies if I'm missing something obvious that has been said. Its almost midnight on a friday night ... so I'm not "entirely" sober ;)
Edit 2: Ooops. Buggered it again. I want to lose the top 3 bits not the lower! So how about this:
fltInt16 = (fltInt32 & 0x007FFFFF) >> 13;
fltInt16 |= (fltInt32 & 0x0f800000) >> 13;
fltInt16 |= (fltInt32 & 0x80000000) >> 16;