views:

1033

answers:

12

I have used unions earlier comfortably; today I was alarmed when I read this post and came to know that this code

union ARGB
{
    uint32_t colour;

    struct componentsTag
    {
        uint8_t b;
        uint8_t g;
        uint8_t r;
        uint8_t a;
    } components;
} pixel;

pixel.colour = 0xff040201;
/* ---- somewhere down the code ---- */
if(pixel.components.a)

is actually undefined behaviour I.e. reading from a member of the union other than the one recently written to leads to undefined behaviour. If this isn't the intended behaviour of unions, what is? Can some one please explain it elaborately?

+2  A: 

As you say, this is strictly undefined behaviour, though it will "work" on many platforms. The real reason for using unions is to create variant records.

union A {
   int i;
   double d;
};

A a[10];    // records in "a" can be either ints or doubles 
a[0].i = 42;
a[1].d = 1.23;

Of course, you also need some sort of discriminator to say what the variant actually contains. And note that in C++ unions are not much use because they can only contain POD types - effectively those without constructors and destructors.

anon
Have you used it thus (like in the question)?? :)
legends2k
It's a bit pedantic, but I don't quite accept "variant records". That is, I'm sure they were in mind, but if they were a priority why not provide them? "Provide the building block because it might be useful to build other things as well" just seems intuitively more likely. Especially given at least one more application that was probably in mind - memory mapped I/O registers, where the input and output registers (while overlapped) are distinct entities with their own names, types etc.
Steve314
@Stev314 If that was the use they had in mind, they could have made it not be undefined behaviour.
anon
@Neil: +1 for the first to say about the actual usage without hitting undefined behaviour. I guess they could have made it implementation defined like other type punning operations (reinterpret_cast, etc.). But like I asked, have you used it for type-punning?
legends2k
@Neil - the memory-mapped register example isn't undefined, the usual endian/etc aside and given a "volatile" flag. Writing to an address in this model doesn't reference the same register as reading the same address. Therefore there is no "what are you reading back" issue as you're not reading back - whatever output you wrote to that address, when you read you're just reading an independent input. The only issue is making sure you read the input side of the union and write the output side. Was common in embedded stuff - probably still is.
Steve314
@legends2k I don't use it because it doesn't really work in C++ for the reason I gave, and because I think it's normally bad design to use variants of any sort.
anon
+1  A: 

Although this is strictly undefined behaviour, in practice it will work with pretty much any compiler. It is such a widely used paradigm that any self-respecting compiler will need to do "the right thing" in cases such as this. It's certainly to be preferred over type-punning, which may well generate broken code with some compilers.

Paul R
Isn't there an endian issue? A relatively easy fix compared with "undefined", but worth taking into account for some projects if so.
Steve314
+4  A: 

If this isn't the intended behaviour of unions, what is? Can some one please explain it elaborately?

“undefined” does not necessarily equal unintended. The fact that this behaviour is undefined follows logically from the fact that hardware and software architectures differ and C tries to cater to all of them (at least in theory). However, for any given platform, we can still make sure that the code works as expected, even though it is undefined, because unportable.

On the other hand, unions may have been developed merely as a means to save memory by aliasing memory addresses and reusing them conveniently. For what it’s worth, I’m not convinced by that explanation. ;-)

Konrad Rudolph
The major problem with that line of reasoning is that optimizers' writers like to (ab)use undefined behaviors to help simplify the generated code. So even if there is a natural implementation for your target, don't rely on it, you won't get it if it helps winning a benchmark.
AProgrammer
+6  A: 

You could use unions to create structs like the following, which contains a field that tells us which component of the union is actually used:

struct VAROBJECT
{
    enum o_t { Int, Double, String } objectType;

    union
    {
        int intValue;
        double dblValue;
        char *strValue;
    } value;
} object;
ammoQ
I totally agree, without entering the undefined-behaviour chaos, perhaps this is the best intended behaviour of unions I can think of; but won't is waste space when am just using, say `int` or `char*` for 10 items of object[]; in which case, I can actually declare separate structs for each data type instead of VAROBJECT? Wouldn't it reduce clutter and use lesser space?
legends2k
legends: In some cases, you simply can't do that. You use something like VAROBJECT in C in the same cases when you use Object in Java.
ammoQ
@ammoQ: Your code along with AndreyT's explanation gives the right examples for the actual purpose of unions; hence I've reselected your answer.
legends2k
+16  A: 

The behavior is undefined from the language point of view. Consider that different platforms can have different constraints in memory alignment and endianness. The code in a big endian versus a little endian machine will update the values in the struct differently. Fixing the behavior in the language would require all implementations to use the same endianness (and memory alignment constraints...) limiting use.

If you are using C++ (you are using two tags) and you really care about portability, then you can just use the struct and provide a setter that takes the uint32_t and sets the fields appropriately through bitmask operations. The same can be done in C with a function.

Edit: I was expecting AProgrammer to write down an answer to vote and close this one. As some comments have pointed out, endianness is dealt in other parts of the standard by letting each implementation decide what to do, and alignment and padding can also be handled differently. Now, the strict aliasing rules that AProgrammer implicitly refers to are a important point here. The compiler is allowed to make assumptions on the modification (or lack of modification) of variables. In the case of the union, the compiler could reorder instructions and move the read of each color component over the write to the colour variable.

David Rodríguez - dribeas
+1 for the clear and simple reply! I agree, for portability, the method you've given in the 2nd para holds good; but can I use the way I've put up in the question, if my code is tied down to a single architecture (paying the price of protability), since it saves 4 bytes for each pixel value and some time saved in running that function?
legends2k
The endian issue doesn't force the standard to declare it as undefined behaviour - reinterpret_cast has exactly the same endian issues, but has implementation defined behaviour.
Joe Gauterin
@legends2k, the problem is that optimizer may assume that an uint32_t is not modified by writing to a uint8_t and so you get the wrong value when the optimized use that assumption... @Joe, the undefined behavior appears as soon as you access the pointer (I know, there are some exceptions).
AProgrammer
@AProgrammer: So without hitting undefined behaviour (unions, reintrepret_cast, type-punning...) I cannot do any bit-level manipulations on a embedded machine is it? In my platform memory is at a premium and I cannot afford to allocate like 8 bytes for a 32 bit pixel colour value.
legends2k
@legends2k, there are some exceptions. I seem to remember to remember that a cast (reinterpret_cast in C++) to a char types is one of them. I don't remember if uint_t is garanteed to be a char type or not. I don't remember similar exception for union (there is an exception for union of structs starting with members of the same types, but that is quite different). Depending on your context, you can check if mask and shift of the uint32_t isn't all what you need with some (inline) access members.
AProgrammer
@legends2k/AProgrammer: The result of a reinterpret_cast is implementation defined. Using the pointer returned does not result in undefined behaviour, only in implementation defined behaviour. In other words, the behaviour must be consistant and defined, but it isn't portable.
Joe Gauterin
@legends2k: any decent optimizer will recognize bitwise operations that select an entire byte and generate code to read/write the byte, same as the union but well-defined (and portable). e.g.uint8_t getRed() const { return colour }void setRed(uint8_t r) { colour = (colour }
Ben Voigt
What Ben Voigt said, with the addition that you can mark those functions as `inline` (or define them as macros), which should allow the optimiser to produce similar or identical code to the `union` construct.
caf
@AProgrammer/Ben Voigt/caf: Lesson learnt; I'll avoid this type of usage with unions altogether and resort to masking and shifting, it's portable by all means :)
legends2k
+1  A: 

You can use a a union for two main reasons:

  1. A handy way to access the same data in different ways, like in your example
  2. A way to save space when there are different data members of which only one can ever be 'active'

1 Is really more of a C-style hack to short-cut writing code on the basis you know how the target system's memory architecture works. As already said you can normally get away with it if you don't actually target lots of different platforms. I believe some compilers might let you use packing directives also (I know they do on structs)?

A good example of 2. can be found in the VARIANT type used extensively in COM.

John
+1 for giving the partical COM example!
legends2k
+1  A: 

Technically it's undefined, but in reality most (all?) compilers treat it exactly the same as using a reinterpret_cast from one type to the other, the result of which is implementation defined. I wouldn't lose sleep over your current code.

Joe Gauterin
+2  A: 

In C it was a nice way to implement something like an variant.

enum possibleTypes{
  eInt,
  eDouble,
  eChar
}


struct Value{

    union Value {
      int iVal_;
      double dval;
      char cVal;
    } value_;
    possibleTypes discriminator_;
} 

switch(val.discriminator_)
{
  case eInt: val.value_.iVal_; break;

In times of litlle memory this structure is using less memory than a struct that has all the member.

By the way C provides

    typedef struct {
      unsigned int mantissa_low:32;      //mantissa
      unsigned int mantissa_high:20;
      unsigned int exponent:11;         //exponent
      unsigned int sign:1;
    } realVal;

to access bit values.

Totonga
Although both your examples are perfectly defined in the standard; but, hey, using bit fields is sure shot unportable code, isn't it?
legends2k
No it isn't. As far as I know its widely supported.
Totonga
+3  A: 

In C++, Boost Variant implement a safe version of the union, designed to prevent undefined behavior as much as possible.

Its performances are identical to the enum + union construct (stack allocated too etc) but it uses a template list of types instead of the enum :)

Matthieu M.
+1  A: 

Others have mentioned the architecture differences (little - big endian).

I read the problem that since the memory for the variables is shared, then by writing to one, the others change and, depending on their type, the value could be meaningless.

eg. union{ float f; int i; } x;

Writing to x.i would be meaningless if you then read from x.f - unless that is what you intended in order to look at the sign, exponent or mantissa components of the float.

I think there is also an issue of alignment: If some variables must be word aligned then you might not get the expected result.

eg. union{ char c[4]; int i; } x;

If on some machine a char had to be word aligned then c[0] and c[1] would share storage with i but not c[2] and c[3].

philcolbourn
+1  A: 

For one more example of the actual use of unions, the CORBA framework serializes objects using the tagged union approach. All user-defined classes are members of one (huge) union, and an integer identifier tells the demarshaller how to interpret the union.

Cubbi
+3  A: 

The purpose of unions is rather obvious, but for some reason people miss it quite often.

The purpose of union is to save memory by using the same memory region for storing different objects at different times. That's it.

It is like a room in a hotel. Different people use it for non-overlapping periods of time. These people never meet, and generally don't need to know anything about each other.

That's exactly what union does. If you know that several objects in your program hold values with non-overlapping value-lifetimes, then you can "merge" these objects into a union and thus save memory. Just like a hotel room has at most one "active" tenant at each moment of time, a union has at most one "active" member at each moment of program time. Only the "active" member can be read. By writing into other member you switch the "active" status to that other member.

For some reason, this original purpose of the union got "overriden" with something completely different: writing one member of a union and then inspecting it through another member. This kind of memory reinterpretation is not a valid use of unions. It generally leads to undefined behavior.

AndreyT
+1 for being elaborate, giving a simple practical example and saying about the legacy of unions!
legends2k