



Hi all. Here are the goals I'm trying to achieve:

  • I need to pack 32 bit IEEE floats into 30 bits.
  • I want to do this by decreasing the size of mantissa by 2 bits.
  • The operation itself should be as fast as possible.
  • I'm aware that some precision will be lost, and this is acceptable.
  • It would be an advantage, if this operation would not ruin special cases like SNaN, QNaN, infinities, etc. But I'm ready to sacrifice this over speed.

I guess this questions consists of two parts:

1) Can I just simply clear the least significant bits of mantissa? I've tried this, and so far it works, but maybe I'm asking for trouble... Something like:

float f;
int packed = (*(int*)&f) & ~3;
// later
f = *(float*)&packed;

2) If there are cases where 1) will fail, then what would be the fastest way to achieve this?

Thanks in advance

+9  A: 

You actually violate the strict aliasing rules (section 3.10 of the C++ standard) with these reinterpret casts. This will probably blow up in your face when you turn on the compiler optimizations.

C++ standard, section 3.10 paragraph 15 says:

If a program attempts to access the stored value of an object through an lvalue of other than one of the following types the behavior is undefined

  • the dynamic type of the object,
  • a cv-qualified version of the dynamic type of the object,
  • a type similar to the dynamic type of the object,
  • a type that is the signed or unsigned type corresponding to the dynamic type of the object,
  • a type that is the signed or unsigned type corresponding to a cv-qualified version of the dynamic type of the object,
  • an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union),
  • a type that is a (possibly cv-qualified) base class type of the dynamic type of the object,
  • a char or unsigned char type.

Specifically, 3.10/15 doesn't allow us to access a float object via an lvalue of type unsigned int. I actually got bitten myself by this. The program I wrote stopped working after turning on optimizations. Apparently, GCC didn't expect an lvalue of type float to alias an lvalue of type int which is a fair assumption by 3.10/15. The instructions got shuffled around by the optimizer under the as-if rule exploiting 3.10/15 and it stopped working.

Under the following assumptions

  • float really corresponds to a 32bit IEEE-float,
  • sizeof(float)==sizeof(int)
  • unsigned int has no padding bits or trap representations

you should be able to do it like this:

/// returns a 30 bit number
unsigned int pack_float(float x) {
    unsigned r;
    std::memcpy(&r,&x,sizeof r);
    return r >> 2;

float unpack_float(unsigned int x) {
    x <<= 2;
    float r;
    std::memcpy(&r,&x,sizeof r);
    return r;

This doesn't suffer from the "3.10-violation" and is typically very fast. At least GCC treats memcpy as an intrinsic function. In case you don't need the functions to work with NaNs, infinities or numbers with extremely high magnitude you can even improve accuracy by replacing "r >> 2" with "(r+1) >> 2":

unsigned int pack_float(float x) {
    unsigned r;
    std::memcpy(&r,&x,sizeof r);
    return (r+1) >> 2;

This works even if it changes the exponent due to a mantissa overflow because the IEEE-754 coding maps consecutive floating point values to consecutive integers (ignoring +/- zero). This mapping actually approximates a logarithm quite well.

A memcopy when a simple cast would be sufficient? Can you explain that rule some more because casts like these never caused any problem with me.
@Toad: On most platforms with a good compiler, you won't actually get a call to `memcpy`; many compilers know about `memcpy`, and will replace it with an inline sequence when the size known at compiler time and is small. That inline sequence is then optimized away (almost) entirely in many circumstances.
Stephen Canon
@Toad: I updated the answer to include a little more on strict aliasing.
+8  A: 

Blindly dropping the 2 LSBs of the float may fail for small number of unusual NaN encodings.

A NaN is encoded as exponent=255, mantissa!=0, but IEEE-754 doesn't say anything about which mantiassa values should be used. If the mantissa value is <= 3, you could turn a NaN into an infinity!

This is true, but since the only nan spontaneously produced on IEEE machines is the real-indefinite-qnan, it can only burn you if you have explicit NaN creation in your application. So I say, go for it...
Dave Dunn
@Dave Dunn: IEEE754 doesn't specify how to encode qNaN/sNaN. It specifies what encodings are NaN, and *recommends* a specific way to encode whether a NaN is quiet or signaling, but IEEE754-compliant hardware could choose to encode that property (or other information) in any number of ways in the significand field of the NaN.
Stephen Canon
@Stephen: actually according IEEE754 qNaN always has its most significant bit in mantissa set to 1, so only dropping bits for sNaN can fail.
@Smilediver: That's incorrect. What you describe is the *recommended* encoding for `qNaN`, but it's not *required*. An IEEE-754 implementation is free to choose a different encoding.
Stephen Canon
The relevant clause in the 2008 standard is **6.2.1 NaN encodings in binary formats**: "A quiet NaN bit string *should* be encoded with…" (emphasis mine). Required behavior in the IEEE-754 document is always stated with *shall*; *should* is recommended practice.
Stephen Canon
I stand corrected. But the interesting thing is the last sentence: "For binary formats, the payload is encoded in the p−2 least significant bits of the trailing significand field.". It doesn't leave any other choices for qNaN than the most significant bit of mantissa. I guess the real question is if there are any CPUs that actually uses other schemes to encode qNaN.
+2  A: 

You should encapsulate it in a struct, so that you don't accidentally mix the usage of the tagged float with regular "unsigned int":

#include <iostream>
using namespace std;

struct TypedFloat {
        union {
            unsigned int raw : 32;
            struct {
                unsigned int num  : 30;  
                unsigned int type : 2;  

        TypedFloat(unsigned int type=0) : num(0), type(type) {}

        operator float() const {
            unsigned int tmp = num << 2;
            return reinterpret_cast<float&>(tmp);
        void operator=(float newnum) {
            num = reinterpret_cast<int&>(newnum) >> 2;
        unsigned int getType() const {
            return type;
        void setType(unsigned int type) {
            this->type = type;

int main() { 
    const unsigned int TYPE_A = 1;
    TypedFloat a(TYPE_A);

    a = 3.4;
    cout << a + 5.4 << endl;
    float b = a;
    cout << a << endl;
    cout << b << endl;
    cout << a.getType() << endl;
    return 0;

I can't guarantee its portability though.

Lie Ryan
"I can't guarantee its portability though" -- Of course not. You have undefined behaviour mixed with implementation-specific stuff (apart from the assumption that float uses an IEEE-754 coding) all over the place.

I can't select any of the answers as the definite one, because most of them have valid information, but not quite what I was looking for. So I'll just summarize my conclusions.

The method for conversion I've posted in my question's part 1) is clearly wrong by C++ standard, so other methods to extract float's bits should be used.

And most important... as far as I understand from reading the responses and other sources about IEEE754 floats, it's ok to drop the least significant bits from mantissa. It will mostly affect only precision, with one exception: sNaN. Since sNaN is represented by exponent set to 255, and mantissa != 0, there can be situation where mantissa would be <= 3, and dropping last two bits would convert sNaN to +/-Infinity. But since sNaN are not generated during floating point operations on CPU, its safe under controlled environment.
