tags:

views:

594

answers:

6

I was reading about union in C from K&R, as far as I understood, a single variable in union can hold any one of the several types and if something is stored as one type and extracted as another the result is purely implementation defined.

Now please check this code snippet:

#include<stdio.h>
int main(void){
  union a{
  int i;
  char ch[2];
  };

  union a u;
  u.ch[0] = 3;
  u.ch[1] = 2;

  printf("%d %d %d\n",u.ch[0],u.ch[1],u.i);

  return 0;
 }

Output:

3 2 515

Here I am assigning values in the u.ch but retrieving from both u.ch and u.i,Is it implementation defined ? or I am doing something really silly ?

I know it may seem very beginner to most of other people but I am unable to figure out the reason behind that output.

Thanks,

+13  A: 

This is undefined behaviour. u.i and u.ch are located in overlapped memory region, but may be aligned differently. This depends on the compiler, platform, architecture, and sometimes even compiler's optimization level. Therefore the output for u.i may not always be 515.

Example

For example gcc on my machine produces two different answers for -O0 and -O2.

  1. Because my machine has 32-bit little-endian architecture, with -O0 I end up with two least significant bytes initialized to 2 and 3, two most significant bytes contain whatever happened to be on the stack at the time. So the union's memory looks like this: {3, 2, garbage, garbage}

    Hence I get the output similar to 3 2 -1216937469.

  2. With -O2, I get the output of 3 2 515 like you do, which makes union memory {3, 2, 0, 0}. What happens is that gcc optimizes the call to printf with actual values, so the assembly output looks like an equivalent of:

    #include <stdio.h>
    int main() {
        printf("%d %d %d\n", 3, 2, 515);
        return 0;
    }
    

    The value 515 can be obtained as other explained in other answers to this question. In essence it means that when gcc optimized the call it has chosen zeroes as the random value of a would-be uninitialized union.

Writing to one union member and reading from another usually does not make much sense, but sometimes it may be useful for programs compiled with strict aliasing.

Alex B
I am almost convinced that the behavior is implementation defined, but the source of the problem is making me think other-wise.Did you really tried the code in a compiler ?
nthrgeek
Yes, in fact output for me is -1216937469
Alex B
But 515 with -O2! (gcc)
Alex B
Thanks I got it now [:)]
nthrgeek
OK, in -O2 case gcc passes constants 2, 3 and 515 on the stack to printf, which is what it thinks the union *would* contain (the union is optimized out). That's not the case with -O0, however!
Alex B
+1 for the link.
nthrgeek
To be pedantic, actually it's undefined behavior, not an implementation-defined result. The difference is that "implementation-defined" in the standard means that there must be a result, and the implementation must document what that result is. Undefined means the implementation is allowed to just crash, do something random, or whatever. "whatever" permits "do something sensible, and document what that is". In practice, implementations always do something sensible in this case and document what it is, because it's such a widely-used trick. So it appears implementation-defined.
Steve Jessop
There is an exception in the standard for unions of structs which have a common initial sequence of members, which provides defined behaviour. See 6.5.2.3/5. Also, since pretty much all implementations don't have trap representations for integer types, those are pretty safe. But it is legal for int to have padding bits, in which case assigning to the char array in that union could create a trap representation (or the unassigned bytes could include padding bits). The attempt to print it would then be undefined behaviour.
Steve Jessop
Corrected now. Thanks Steve.
Alex B
+2  A: 

It is implementation dependent and results might vary on a different platform/compiler but it seems this is what is happening:

515 in binary is

1000000011

Padding zeros to make it two bytes (assuming 16 bit int):

0000001000000011

The two bytes are:

00000010 and 00000011

Which is 2 and 3

Hope someone explains why they are reversed - my guess is that chars are not reversed but the int is little endian.

Amount of memory allocated to a union is equal to the memory required to store the biggest member. In this case, you have an int and a char array of length 2. Assuming int is 16 bit and char is 8 bit, both require same space and hence the union is allocated two bytes.

When you assign three (00000011) and two (00000010) to the char array, the state of union is 0000001100000010. When you read the int from this union, it converts the whole thing into and integer. Assuming little-endian representation where LSB is stored at lowest address, the int read from the union would be 0000001000000011 which is the binary for 515.

NOTE: This holds true even if the int was 32 bit - Check Amnon's answer

Amarghosh
My processor is little-endian
nthrgeek
That was a mistake - I was referring to little endian though I typed big endian. This is what is happening - even if your int is 32 bit. See the update.
Amarghosh
How will you explain ? `int main(void) { union a{ int i; char ch[3]; }; union a u; u.ch[0] = 3; u.ch[1] = 2; u.ch[2] = 2; printf("%d %d %d\n",u.ch[0],u.ch[1],u.i); return 0; } `
nthrgeek
Did you get 131587?
Amarghosh
or 131842 - if you got one of these, i think i know whats happening - otherwise :(
Amarghosh
No, in my machine the output is exactly the same as before.
nthrgeek
What does `printf("%d %d", sizeof(int), sizeof(char))` print?
Amarghosh
my guess is that your int is 16 bit. I got 131587 as expected on my 32 bit int compiler (that printed 4 for sizeof int)
Amarghosh
Yes I tried this on 16-bit compiler.
nthrgeek
In the question paper it is given to take integer as 2 byte size.
nthrgeek
Then it is clear. The union is allocated the size of char array (3 bytes) which is greater than the size of an int (2 bytes). When you read int from the union, it considers only the first two bytes (and inverts them thanks to little endian processor) and hence the result 515.
Amarghosh
+4  A: 

The output from such code will be dependent on your platform and C compiler implementation. Your output makes me think you're running this code on a litte-endian system (probably x86). If you were to put 515 into i and look at it in a debugger, you would see that the lowest-order byte would be a 3 and the next byte in memory would be a 2, which maps exactly to what you put in ch.

If you did this on a big-endian system, you would have (probably) gotten 770 (assuming 16-bit ints) or 50462720 (assuming 32-bit ints).

Ferruccio
+1 for the correct explanation.
nthrgeek
How will you explain this:#include <stdio.h>int main(void) { union a{ int i; char ch[3]; }; union a u; u.ch[0] = 3; u.ch[1] = 2; u.ch[2] = 2; printf("%d %d %d\n",u.ch[0],u.ch[1],u.i); return 0; }??
nthrgeek
+6  A: 

The reason behind the output is that on your machine integers are stored in little-endian format: the least-significant bytes are stored first. Hence the byte sequence [3,2,0,0] represents the integer 3+2*256=515.

This result depends on the specific implementation and the platform.

Amnon
I really liked your answer.Thanks,
nthrgeek
Technically undefined, not implementation-defined. The terms have different meanings in the standard.
Steve Jessop
+2  A: 

If you're on a 32-bit system, then an int is 4 bytes but you only initialise only 2 bytes. Accessing uninitialised data is undefined behaviour.

Assuming you're on a system with 16-bit ints, then what you are doing is still implementation defined. If your system is little endian, then u.ch[0] will correspond with the least significant byte of u.i and u.ch[1] will be the most significant byte. On a big endian system, it's the other way around. Also, the C standard does not force the implementation to use two's complement to represent signed integer values, though two's complement is the most common. Obviously, the size of an integer is also implementation defined.

Hint: it's easier to see what's happening if you use hexadecimal values. On a little endian system, the result in hex would be 0x0203.

notacat
+1  A: 

The correct answer from the point of C language is that in general writing one member of union and reading another member is undefined behavior. Not implementation defined (which is a different thing), but undefined behavior. The relevant portion of the language standard in this case is 6.5/7.

Your specific example is however, has some redeeming properties, since one of the members of your union is a char array. It is legal in C language to reinterpret the content of any object as a char array (again, 6.5/7).

However, the reverse is not true. Writing data into the char array member of your union and then reading it as an int is, again, undefined behavior. The rationale behind this is that not only because your char array might be shorter than the int, but also because the int value you handcraft by manipulating char bytes might not be a valid int value.

In other words, you can legally write data into the int member of the union and the read the members of char array, but not the other way around. So, the correct answer to your question, once again: from the point of view of C language your code produces undefined behavior, not implementation-defined behavior.

AndreyT