views:

664

answers:

8

Hey SO, This should be easy for all you C hackers out there :D

Anyways, I want to store a 4-byte int in a char array... such that the first 4 locations of the char array are the 4 bytes of the int.

Then, I want to pull the int back out of the array...

Also, bonus points if someone can give me code for doing this in a loop... IE writing like 8 ints into a 32 byte array.

int har = 0x01010101;
char a[4];
int har2;

// write har into char such that:
// a[0] == 0x01, a[1] == 0x01, a[2] == 0x01, a[3] == 0x01 etc.....

// then, pull the bytes out of the array such that:
// har2 == har

Thanks guys!

EDIT: Assume int are 4 bytes...

EDIT2: Please don't care about endianness... I will be worrying about endianness. I just want different ways to acheive the above in C/C++. Thanks

EDIT3: If you can't tell, I'm trying to write a serialization class on the low level... so I'm looking for different strategies to serialize some common data types.

+17  A: 

Unless you care about byte order and such, memcpy will do the trick:

memcpy(a, &har, sizeof(har));
...
memcpy(&har2, a, sizeof(har2));

Of course, there's no guarantee that sizeof(int)==4 on any particular implementation (and there are real-world implementations for which this is in fact false).

Writing a loop should be trivial from here.

Pavel Minaev
+4  A: 

Don't use unions, Pavel clarifies:

It's U.B., because C++ prohibits accessing any union member other than the last one that was written to. In particular, the compiler is free to optimize away the assignment to int member out completely with the code above, since its value is not subsequently used (it only sees the subsequent read for the char[4] member, and has no obligation to provide any meaningful value there). In practice, g++ in particular is known for pulling such tricks, so this isn't just theory. On the other hand, using static_cast<void*> followed by static_cast<char*> is guaranteed to work.

– Pavel Minaev

GMan
It's U.B., because C++ prohibits accessing any union member other than the last one that was written to. In particular, the compiler is free to optimize away the assignment to `int` member out completely with the code above, since its value is not subsequently used (it only sees the subsequent read for the `char[4]` member, and has no obligation to provide any meaningful value there). In practice, g++ in particular is known for pulling such tricks, so this isn't just theory. On the other hand, using `static_cast<void*>` followed by `static_cast<char*>` is guaranteed to work.
Pavel Minaev
Thought so, I never clarified it, though. If you don't mind, I'll leave your comment as advice.
GMan
I don't mind, but it would be nice to fix those `static_cast`s :)
Pavel Minaev
Fix'd. [15char]
GMan
+5  A: 
int main() {
    typedef union foo {
        int x;
        char a[4];
    } foo;

    foo p;
    p.x = 0x01010101;
    printf("%x ", p.a[0]);
    printf("%x ", p.a[1]);
    printf("%x ", p.a[2]);
    printf("%x ", p.a[3]);

    return 0;
}

Bear in mind that the a[0] holds the LSB and a[3] holds the MSB, on a little endian machine.

Ashwin
Your comment about the LSB and MSB only holds true for little endian architectures.
1800 INFORMATION
The read of `p.a` in this code invokes U.B., because it was not preceded by a write to `a`. Any conformant C++ implementation can legally optimize away the assignment to `p.x` completely, and some will do so.
Pavel Minaev
Umm, yes and no. The exact result is U.B., I guess, because it depends on platform architecture, but unions are the one legal way to alias different types and I would be quite surprised at a compiler that didn't totally understand that p.a had been written. In fact, unions are the *only* official way around type aliasing optimization in gnu implementations.
DigitalRoss
That's true, I guess, but unions are not the only way to solve this problem, and there are solutions that do not invoke UB so it is probably best to favour those.
1800 INFORMATION
It is not legal to alias any two arbitrary types (unions or not - just don't do this, period), but it is perfectly legal to alias any POD type via a `char*`, and g++ supports that as well. The only caveat is that to be strictly conformant, you must `static_cast` to `char*` rather than `reinterpret_cast` or C-style cast (which means that you must first `static_cast` to `void*`) - though I haven't seen any implementation where that last bit actually makes any difference...
Pavel Minaev
Actually, just for the sake of completeness - it is legal to alias two POD structs in a union if they have a "common sequence" of fields (i.e. same types in same order) at the beginning, but then you can only alias those common fields...
Pavel Minaev
@DigitalRoss: Please look up what U.B. means. UB by definition does not "depend". If it depends on platform architecture then it is unspecified or implementation-specified, not UB. With UB, all bets are off, and, as Pavel says, the compiler could just optimize it away.I know GCC specifically allows the union trick, but that doesn't make it official. And it isn't the "only" way either.
jalf
+5  A: 
#include <stdio.h>

int main(void) {
    char a[sizeof(int)];
    *((int *) a) = 0x01010101;
    printf("%d\n", *((int *) a));
    return 0;
}

Keep in mind:

A pointer to an object or incomplete type may be converted to a pointer to a different object or incomplete type. If the resulting pointer is not correctly aligned for the pointed-to type, the behavior is undefined.

Sinan Ünür
The pointer can be converted, but that doesn't mean that it can be dereferenced. E.g. you can convert `int*` to `float*` (no U.B.), but as soon as you try to write anything via that `float*`, you hit U.B. Your example is fine because writing via `char*` is specifically allowed for PODs, and lifetime of POD starts as soon as memory is allocated for it, but this is worth clarifying.
Pavel Minaev
Actually, sorry, I'm wrong, and this example is still U.B. - specifically, there's no guarantee that `a` is correctly aligned for `int`. There is a guarantee when allocating arrays with `new`, that they will be correctly aligned for any object of the same size as array; but there's no such guarantee for auto or static variables, or member fields. E.g. consider local variable declarations: `char c; char a[4];` - there's a good chance that `a` will not be allocated on a 4-byte boundary, and on some architectures this will result in a crash when you try to write into that location via an `int*`.
Pavel Minaev
Pavel, could you clarify what you mean by POD and U.B.? Thanks
Polaris878
POD = Plain Old Data type and UB = Undefined Behavior.
Sinan Ünür
Accessing any data type using a char pointer is fine. However, assuming data pointed to by a char pointer is correctly aligned for some other data type results in undefined behavior. Anything can happen.
Sinan Ünür
POD = Plain Old Data. U.B. = Undefined Behavior. The meanings of those two terms are precisely defined in ISO C++ specification. U.B. basically means "anything at all can happen, with no limits". POD means more or less "one of C++ primitive types like int or float, any pointer type, any enum type, array of any POD type, or any struct/classe/union consisting solely of fields of POD types, with no non-public members, no base classes, no explicit ctors or dtors, and no virtual members."
Pavel Minaev
It is safe to assume that pointer is correctly aligned if you allocate memory like this: `char* a = new char[sizeof(int)]`. The resulting block of memory is guaranteed to be aligned properly for any object that can fit into that block - including, obviously, an int. On a side note, it's worth looking at how much trickery `boost::optional` has to do to get the alignment right while avoiding heap allocation: http://www.boost.org/doc/libs/1_39_0/boost/optional/optional.hpp - have a look at `type_with_alignment` template...
Pavel Minaev
Thanks, I know what the terms mean I just wasn't sure on the acronyms :)
Polaris878
Are you sure dereferencing the casted pointer is UB? I'm pretty sure there's something about it just behaving as if it's pointing to 'an object with an unspecified value of type T'. Moreover, the note in 5.3.4:10 specifically mentions that char arrays are max aligned to allow "the common idiom of allocating character arrays into which objects of other types will later be placed".
jalf
@jalf I only have `n1124.pdf` (ISO/IEC 9899:TC2) and there is no section 5.3.4 in that document. I think you are referring to the C++ standard (deduced from http://www.boost.org/doc/libs/1_40_0/libs/pool/doc/implementation/alignment.html ). In any case, no I am not sure if the code above invokes UB although I cannot find anything in the C standard that guarantees that it does not
Sinan Ünür
+4  A: 

Note: Accessing a union through an element that wasn't the last one assigned to is undefined behavior. (assuming a platform where characters are 8bits and ints are 4 bytes) A bit mask of 0xFF will mask off one character so

char arr[4];
int a = 5;

arr[3] = a & 0xff;
arr[2] = (a & 0xff00) >>8;
arr[1] = (a & 0xff0000) >>16;
arr[0] = (a & 0xff000000)>>24;

would make arr[0] hold the most significant byte and arr[3] hold the least.

edit:Just so you understand the trick & is bit wise 'and' where as && is logical 'and'. Thanks to the comments about the forgotten shift.

stonemetal
+1, that's the way to go if a specific binary representation is required (i.e. no LSB/MSB confusion).
Pavel Minaev
Don't forget to shift!
Polaris878
As pointed out by Polaris878, the last 3 assignments will set '0' into the array, since you haven't used ">>" on the values.
Richard Corden
+3  A: 
Does this run into trouble if har is negative? (I Seem to remember that ther'es something odd about bitshifting and negative ints...)
Michael Anderson
A: 

You can also use placement new for this:

void foo (int i) {
  char * c = new (&i) char[sizeof(i)];
}
Richard Corden
A: 
union value {
   int i;
   char bytes[sizof(int)];
};

value v;
v.i = 2;

char* bytes = v.bytes;
codie