ansaurus

Question

32-bit to 16-bit Floating Point Conversion

Answer 1

+5 A:

frexp extracts the significand and exponent from normal floats or doubles -- then you need to decide what to do with exponents that are too large to fit in a half-precision float (saturate...?), adjust accordingly, and put the half-precision number together. This article has C source code to show you how to perform the conversion.

Alex Martelli 2009-11-02 04:55:25

Actually, the values I'm sending have very limited range: (-1000, 1000) so the exponent isn't that big of an issue.

Matt Fichman 2009-11-02 05:04:28

@Matt, if you **know** the exponent will never under/over flow, then your job's easier by that much!-)

Alex Martelli 2009-11-02 05:41:33

@Alex, indeed, it does make it easier! Thanks.

Matt Fichman 2009-11-03 01:37:10

Answer 2

+4 A:

Given your needs (-1000, 1000), perhaps it would be better to use a fixed-point representation.

//change to 20000 to SHORT_MAX if you don't mind whole numbers
//being turned into fractional ones
const int compact_range = 20000;

short compactFloat(double input) {
    return round(input * compact_range / 1000);
}
double expandToFloat(short input) {
    return ((double)input) * 1000 / compact_range;
}

This will give you accuracy to the nearest 0.05. If you change 20000 to SHORT_MAX you'll get a bit more accuracy but some whole numbers will end up as decimals on the other end.

Artelius 2009-11-02 05:35:29

+1 This will get you *much more* accuracy than a 16 bit float in almost every case, and with less math and no special cases. A 16-bit IEEE float will only have 10 bits of accuracy and crams half of its possible values in the range (-1, 1)

Shmoopty 2009-11-02 08:31:16

It depends on the distribution in the range [-1000, 1000]. If most numbers are in fact in the range [-1,1], then the accuracy of 16 bits floats is on average better.

MSalters 2009-11-02 10:13:41

This would be better with SHORT_MAX and 1024 as the scale factor, giving a 10.6bit fixed point representation and allintegers would be exactly representable. The precision would be 1/2^6 = 0.015625, which is far better than 0.05, and the power-of-two scale factor is easy to optimise to a bit-shift (the compiler is likely to do it for you).

Clifford 2009-11-02 20:59:12

Sorry that should have been 11.5 (forgot the sign bit!). Then the precision is 1/2^5 = 0.0325; still not bad for something that will also perform better.

Clifford 2009-11-02 21:01:20

@Clifford: Totally right. I have no idea why I didn't think of the 1024 thing.

Artelius 2009-11-02 22:03:19

Matt Fichman 2009-11-03 01:39:37

@Matt, is it possible to send your normalised values using a different format to the position vectors? Consider using an appropriate fixed-point scheme for each of them.

Artelius 2009-11-03 06:06:35

Answer 3

+3 A:

If you're sending a stream of information across, you could probably do better than this, especially if everything is in a consistent range, as your application seems to have.

Send a small header, that just consists of a float32 minimum and maximum, then you can send across your information as a 16 bit interpolation value between the two. As you also say that precision isn't much of an issue, you could even send 8bits at a time.

Your value would be something like, at reconstruction time:

float t = _t / numeric_limits<unsigned short>::max();  // With casting, naturally ;)
float val = h.min + t * (h.max - h.min);

Hope that helps.

-Tom

tsalter 2009-11-02 10:01:32

This is a great solution, especially for normalized vector/quaternion values which you know will always be in the range (-1, 1).

Matt Fichman 2009-11-03 02:00:05

+1 for using `numeric_limits`.

xtofl 2010-08-23 12:33:34

Answer 4

A:

Complete conversion from single precision to half precision. This is a direct copy from my SSE version, so it's branch-less. It makes use of the fact that in GCC (-true == ~0), may be true for VisualStudio too but, I don't have a copy.

    class Float16Compressor
    {
        union Bits
        {
            float f;
            int32_t si;
            uint32_t ui;
        };

        static int const shift = 13;
        static int const shiftSign = 16;

        static int32_t const infN = 0x7F800000; // flt32 infinity
        static int32_t const maxN = 0x477FE000; // max flt16 normal as a flt32
        static int32_t const minN = 0x38800000; // min flt16 normal as a flt32
        static int32_t const signN = 0x80000000; // flt32 sign bit

        static int32_t const infC = infN >> shift;
        static int32_t const nanN = (infC + 1) << shift; // minimum flt16 nan as a flt32
        static int32_t const maxC = maxN >> shift;
        static int32_t const minC = minN >> shift;
        static int32_t const signC = signN >> shiftSign; // flt16 sign bit

        static int32_t const mulN = 0x52000000; // (1 << 23) / minN
        static int32_t const mulC = 0x33800000; // minN / (1 << (23 - shift))

        static int32_t const subC = 0x003FF; // max flt32 subnormal down shifted
        static int32_t const norC = 0x00400; // min flt32 normal down shifted

        static int32_t const maxD = infC - maxC - 1;
        static int32_t const minD = minC - subC - 1;

    public:

        static uint16_t compress(float value)
        {
            Bits v, s;
            v.f = value;
            uint32_t sign = v.si & signN;
            v.si ^= sign;
            sign >>= shiftSign; // logical shift
            s.si = mulN;
            s.si = s.f * v.f; // correct subnormals
            v.si ^= (s.si ^ v.si) & -(minN > v.si);
            v.si ^= (infN ^ v.si) & -((infN > v.si) & (v.si > maxN));
            v.si ^= (nanN ^ v.si) & -((nanN > v.si) & (v.si > infN));
            v.ui >>= shift; // logical shift
            v.si ^= ((v.si - maxD) ^ v.si) & -(v.si > maxC);
            v.si ^= ((v.si - minD) ^ v.si) & -(v.si > subC);
            return v.ui | sign;
        }

        static float decompress(uint16_t value)
        {
            Bits v;
            v.ui = value;
            int32_t sign = v.si & signC;
            v.si ^= sign;
            sign <<= shiftSign;
            v.si ^= ((v.si + minD) ^ v.si) & -(v.si > subC);
            v.si ^= ((v.si + maxD) ^ v.si) & -(v.si > maxC);
            Bits s;
            s.si = mulC;
            s.f *= v.si;
            int32_t mask = -(norC > v.si);
            v.si <<= shift;
            v.si ^= (s.si ^ v.si) & mask;
            v.si |= sign;
            return v.f;
        }
    };

So that's a lot to take in but, it handles all subnormal values, both infinities, quiet NaNs, signaling NaNs, and negative zero. Of course full IEEE support isn't always needed. So compressing generic floats:

    class FloatCompressor
    {
        union Bits
        {
            float f;
            int32_t si;
            uint32_t ui;
        };

        bool hasNegatives;
        bool noLoss;
        int32_t _maxF;
        int32_t _minF;
        int32_t _epsF;
        int32_t _maxC;
        int32_t _zeroC;
        int32_t _pDelta;
        int32_t _nDelta;
        int _shift;

        static int32_t const signF = 0x80000000;
        static int32_t const absF = ~signF;

    public:

        FloatCompressor(float min, float epsilon, float max, int precision)
        {
            // legal values
            // min <= 0 < epsilon < max
            // 0 <= precision <= 23
            _shift = 23 - precision;
            Bits v;
            v.f = min;
            _minF = v.si;
            v.f = epsilon;
            _epsF = v.si;
            v.f = max;
            _maxF = v.si;
            hasNegatives = _minF < 0;
            noLoss = _shift == 0;
            int32_t pepsU, nepsU;
            if(noLoss) {
                nepsU = _epsF;
                pepsU = _epsF ^ signF;
                _maxC = _maxF ^ signF;
                _zeroC = signF;
            } else {
                nepsU = uint32_t(_epsF ^ signF) >> _shift;
                pepsU = uint32_t(_epsF) >> _shift;
                _maxC = uint32_t(_maxF) >> _shift;
                _zeroC = 0;
            }
            _pDelta = pepsU - _zeroC - 1;
            _nDelta = nepsU - _maxC - 1;
        }

        float clamp(float value)
        {
            Bits v;
            v.f = value;
            int32_t max = _maxF;
            if(hasNegatives)
                max ^= (_minF ^ _maxF) & -(0 > v.si);
            v.si ^= (max ^ v.si) & -(v.si > max);
            v.si &= -(_epsF <= (v.si & absF));
            return v.f;
        }

        uint32_t compress(float value)
        {
            Bits v;
            v.f = clamp(value);
            if(noLoss)
                v.si ^= signF;
            else
                v.ui >>= _shift;
            if(hasNegatives)
                v.si ^= ((v.si - _nDelta) ^ v.si) & -(v.si > _maxC);
            v.si ^= ((v.si - _pDelta) ^ v.si) & -(v.si > _zeroC);
            if(noLoss)
                v.si ^= signF;
            return v.ui;
        }

        float decompress(uint32_t value)
        {
            Bits v;
            v.ui = value;
            if(noLoss)
                v.si ^= signF;
            v.si ^= ((v.si + _pDelta) ^ v.si) & -(v.si > _zeroC);
            if(hasNegatives)
                v.si ^= ((v.si + _nDelta) ^ v.si) & -(v.si > _maxC);
            if(noLoss)
                v.si ^= signF;
            else
                v.si <<= _shift;
            return v.f;
        }

    };

This forces all values into the accepted range, no support for NaNs, infinities or negative zero. Epsilon is the smallest allowable value in the range. Precision is how many bits of precision to retain from the float. While there are a lot of branches above, they are all static and will be cached by the branch predictor in the CPU.

Of course if your values don't require logarithmic resolution approaching zero. Then linearizing them to a fixed point format is much faster, as was already mentioned.

I use the FloatCompressor (SSE version) in graphics library for reducing the size of linear float color values in memory. Compressed floats have the advantage of creating small look-up tables for time consuming functions, like gamma correction or transcendentals. Compressing linear sRGB values reduces to a max of 12 bits or a max value of 3011, which is great for a look-up table size for to/from sRGB.

Phernost 2010-08-22 19:23:38

ansaurus

tags:

views:

answers:

32-bit to 16-bit Floating Point Conversion

related questions