views:

1482

answers:

5

In an Open Source program I wrote, I'm reading binary data (written by another program) from a file and outputting ints, doubles, and other assorted data types. One of the challenges is that it needs to run on 32-bit and 64-bit machines of both endiannesses, which means that I end up having to do quite a bit of low-level bit-twiddling. I know a (very) little bit about type punning and strict aliasing and want to make sure I'm doing things the right way.

Basically, it's easy to convert from a char* to an int of various sizes:

int64_t snativeint64_t(const char *buf) 
{
    /* Interpret the first 8 bytes of buf as a 64-bit int */
    return *(int64_t *) buf;
}

and I have a cast of support functions to swap byte orders as needed, such as:

int64_t swappedint64_t(const int64_t wrongend)
{
    /* Change the endianness of a 64-bit integer */
    return (((wrongend & 0xff00000000000000LL) >> 56) |
            ((wrongend & 0x00ff000000000000LL) >> 40) |
            ((wrongend & 0x0000ff0000000000LL) >> 24) |
            ((wrongend & 0x000000ff00000000LL) >> 8)  |
            ((wrongend & 0x00000000ff000000LL) << 8)  |
            ((wrongend & 0x0000000000ff0000LL) << 24) |
            ((wrongend & 0x000000000000ff00LL) << 40) |
            ((wrongend & 0x00000000000000ffLL) << 56));
}

At runtime, the program detects the endianness of the machine and assigns one of the above to a function pointer:

int64_t (*slittleint64_t)(const char *);
if(littleendian) {
    slittleint64_t = snativeint64_t;
} else {
    slittleint64_t = sswappedint64_t;
}

Now, the tricky part comes when I'm trying to cast a char* to a double. I'd like to re-use the endian-swapping code like so:

union 
{
    double  d;
    int64_t i;
} int64todouble;

int64todouble.i = slittleint64_t(bufoffset);
printf("%lf", int64todouble.d);

However, some compilers could optimize away the "int64todouble.i" assignment and break the program. Is there a safer way to do this, while considering that this program must stay optimized for performance, and also that I'd prefer not to write a parallel set of transformations to cast char* to double directly? If the union method of punning is safe, should I be re-writing my functions like snativeint64_t to use it?


I ended up using onebyonelivejournalcom's answer because the conversion functions re-written to use memcpy, like so:

int64_t snativeint64_t(const char *buf) 
{
    /* Interpret the first 8 bytes of buf as a 64-bit int */
    int64_t output;
    memcpy(&output, buf, 8);
    return output;
}

compiled into the exact same assembler as my original code:

snativeint64_t:
        movq    (%rdi), %rax
        ret

Of the two, the memcpy version more explicitly expresses what I'm trying to do and should work on even the most naive compilers.

Adam, your answer was also wonderful and I learned a lot from it. Thanks for posting!

+12  A: 

I highly suggest you read Understanding Strict Aliasing. Specifically, see the sections labeled "Casting through a union". It has a number of very good examples. While the article is on a website about the Cell processor and uses PPC assembly examples, almost all of it is equally applicable to other architectures, including x86.

Adam Rosenfield
Thanks! That's the sort of thing I was looking for. I'm off to read now.
Just Some Guy
Link gives a 404, the article is here now:http://cellperformance.beyond3d.com/articles/2006/06/understanding-strict-aliasing.html
ryan_s
@ryan_s: Thanks, fixed
Adam Rosenfield
+2  A: 

The standard says that writing to one field of a union and reading from it immediately is undefined behaviour. So if you go by the rule book, the union based method won't work.

Macros are usually a bad idea, but this might be an exception to the rule. It should be possible to get template-like behaviour in C using a set of macros using the input and output types as parameters.

Pramod
The GCC manual says that "Even with -fstrict-aliasing, type-punning is allowed, provided the memory is accessed through the union type." It's *so* tempting to call that good enough, but I hate to write compiler-specific code. Got a pointer to a macro example?
Just Some Guy
A: 

As a very small sub-suggestion, I suggest you investigate if you can swap the masking and the shifting, in the 64-bit case. Since the operation is swapping bytes, you should be able to always get away with a mask of just 0xff. This should lead to faster, more compact code, unless the compiler is smart enough to figure that one out itself.

In brief, changing this:

(((wrongend & 0xff00000000000000LL) >> 56)

into this:

((wrongend >> 56) & 0xff)

should generate the same result.

unwind
That would only work for the first mask-and-shift operation since all the others are moving bits into the middle of the output.
Just Some Guy
True, then you'd have to shift it back up after masking. I would probably prefer doing that, since avoiding the huge constants (to me) is nice. Just swapping the order works better when you're extracting the bytes and doing something else with then, byte-for-byte.
unwind
A: 

Edit:
Removed comments regarding how to effectively store data always big endian and swapping to machine endianess, as questioner hasn't mentioned another program writes his data (which is important information).

Still if the data needs conversion from any endian to big and from big to host endian, ntohs/ntohl/htons/htonl are the best methods, most elegant and unbeatable in speed (as they will perform task in hardware if CPU supports that, you can't beat that).


Regarding double/float, just store them to ints by memory casting:

double d = 3.1234;
printf("Double %f\n", d);
int64_t i = *(int64_t *)&d;
// Now i contains the double value as int
double d2 = *(double *)&i;
printf("Double2 %f\n", d2);

Wrap it into a function

int64_t doubleToInt64(double d)
{
    return *(int64_t *)&d;
}

double int64ToDouble(int64_t i)
{
    return *(double *)&i;
}


Questioner provided this link:

http://cocoawithlove.com/2008/04/using-pointers-to-recast-in-c-is-bad.html

as a prove that casting is bad... unfortunately I can only strongly disagree with most of this page. Quotes and comments:

As common as casting through a pointer is, it is actually bad practice and potentially risky code. Casting through a pointer has the potential to create bugs because of type punning.

It is not risky at all and it is also not bad practice. It has only a potential to cause bugs if you do it incorrectly, just like programming in C has the potential to cause bugs if you do it incorrectly, so does any programming in any language. By that argument you must stop programming altogether.

Type punning
A form of pointer aliasing where two pointers and refer to the same location in memory but represent that location as different types. The compiler will treat both "puns" as unrelated pointers. Type punning has the potential to cause dependency problems for any data accessed through both pointers.

This is true, but unfortunately totally unrelated to my code.

What he refers to is code like this:

int64_t * intPointer;
:
// Init intPointer somehow
:
double * doublePointer = (double *)intPointer;

Now doublePointer and intPointer both point to the same memory location, but treating this as the same type. This is the situation you should solve with a union indeed, anything else is pretty bad. Bad that is not what my code does!

My code copies by value, not by reference. I cast a double to int64 pointer (or the other way round) and immediately deference it. Once the functions return, there is no pointer held to anything. There is a int64 and a double and these are totally unrelated to the input parameter of the functions. I never copy any pointer to a pointer of a different type (if you saw this in my code sample, you strongly misread the C code I wrote), I just transfer the value to a variable of different type (in an own memory location). So the definition of type punning does not apply at all, as it says "refer to the same location in memory" and nothing here refers to the same memory location.

int64_t intValue = 12345;
double doubleValue = int64ToDouble(intValue);
// The statement below will not change the value of doubleValue!
// Both are not pointing to the same memory location, both have their
// own storage space on stack and are totally unreleated.
intValue = 5678;

My code is nothing more than a memory copy, just written in C without an external function.

int64_t doubleToInt64(double d)
{
    return *(int64_t *)&d;
}

Could be written as

int64_t doubleToInt64(double d)
{
    int64_t result;
    memcpy(&result, &d, sizeof(d));
    return result;
}

It's nothing more than that, so there is no type punning even in sight anywhere. And this operation is also totally safe, as safe as an operation can be in C. A double is defined to always be 64 Bit (unlike int it does not vary in size, it is fixed at 64 bit), hence it will always fit into a int64_t sized variable.

Mecki
On your first point, the program reads data generated by another program. On the second point, this seems to be frowned on: http://cocoawithlove.com/2008/04/using-pointers-to-recast-in-c-is-bad.html and part of what I'm asking is whether I should be getting away from it altogether.
Just Some Guy
See update above. There is no type punning involved as your linked page claims, not even close to. And unlike your code I also never cast a char pointer to anything (as this ABSOLUTELY unsafe!), I pass all data by value (never by reference!) and I cast only between types guaranteed to have same size
Mecki
Passing by value would be impossible in my code for performance reasons. I never cast a char pointer; I cast its contents. Finally, ntoh* only works on casting big-endian values. There is no corresponding function for little-endian values.
Just Some Guy
+1  A: 

Since you seem to know enough about your implementation to be sure that int64_t and double are the same size, and have suitable storage representations, you might hazard a memcpy. Then you don't even have to think about aliasing.

Since you're using a function pointer for a function that might easily be inlined if you were willing to release multiple binaries, performance must not be a huge issue anyway, but you might like to know that some compilers can be quite fiendish optimising memcpy - for small integer sizes a set of loads and stores can be inlined, and you might even find the variables are optimised away entirely and the compiler does the "copy" simply be reassigning the stack slots it's using for the variables, just like a union.

int64_t i = slittleint64_t(buffoffset);
double d;
memcpy(&d,&i,8); /* might emit no code if you're lucky */
printf("%lf", d);

Examine the resulting code, or just profile it. Chances are even in the worst case it will not be slow.

In general, though, doing anything too clever with byteswapping results in portability issues. There exist ABIs with middle-endian doubles, where each word is little-endian, but the big word comes first.

Normally you could consider storing your doubles using sprintf and sscanf, but for your project the file formats aren't under your control. But if your application is just shovelling IEEE doubles from an input file in one format to an output file in another format (not sure if it is, since I don't know the database formats in question, but if so), then perhaps you can forget about the fact that it's a double, since you aren't using it for arithmetic anyway. Just treat it as an opaque char[8], requiring byteswapping only if the file formats differ.

Steve Jessop
Great memcpy tip - thanks! I do actually need the doubles output in text form or I'd just sling the raw bytes around. Also, I profiled it heavily with and without function pointers (because I was willing to skip big endian if it had much effect) but was there was no measurable difference.
Just Some Guy