views:

1364

answers:

9

In C++, sizeof('a') == sizeof(char) == 1. This makes intuitive sense, since 'a' is a character literal, and sizeof(char) is defined to be 1 by the standard. But in C, sizeof('a') == sizeof(int). That is, it appears that C character literals are actually integers. Does anyone know why? I can find plenty of mentions of this C quirk but no explanation for why it exists.

+3  A: 

I remember reading K&R and seeing a code snippet that would read a character at a time until it hit EOF. Since all characters are valid characters to be in a file/input stream, this means that EOF cannot be any char value. What the code did was to put the read character into an int, then test for EOF, then convert to a char if it wasn't.

I realize this doesn't exactly answer your question, but it would make some sense for the rest of the character literals to be sizeof(int) if the EOF literal was.

int r;
char buffer[1024], *p; // don't use in production - buffer overflow likely
p = buffer;

while ((r = getc(file)) != EOF)
{
  *(p++) = (char) r;
}
Kyle Cronin
I don't think 0 is a valid character though.
gbjbaanb
@gbjbaanb: Sure it is. It's the null character. Think about it. Do you think a file shouldn't be allowed to contain any zero bytes?
P Daddy
A null-terminated file might make sense for textual data, but if it's binary I think \0 should be considered a valid value.
Kyle Cronin
Read wikipedia - "The actual value of EOF is a system-dependent negative number, commonly -1, which is guaranteed to be unequal to any valid character code."
Malx
As Malx says - EOF is not a char type - it's an int type. getchar() and friends return an int, which can hold any char as well as EOF without conflict. This would really not require literal chars to have type int.
Michael Burr
+2  A: 

I don't know, but I'm going to guess it was easier to implement it that way and it didn't really matter. It wasn't until C++ when the type could determine which function would get called that it needed to be fixed.

FigBug
+1  A: 

I didn't know this indeed. Before prototypes existed, anything narrower than an int was converted to an int when using it as a function argument. That may be part of the explanation.

Blaisorblade
+5  A: 

using gcc on my MacBook, I try:

#include <stdio.h>
#define test(A) do{printf(#A":\t%i\n",sizeof(A));}while(0)
int main(void){
  test('a');
  test("a");
  test("");
  test(char);
  test(short);
  test(int);
  test(long);
  test((char)0x0);
  test((short)0x0);
  test((int)0x0);
  test((long)0x0);
  return 0;
};

which when run gives:

'a':    4
"a":    2
"":     1
char:   1
short:  2
int:    4
long:   4
(char)0x0:      1
(short)0x0:     2
(int)0x0:       4
(long)0x0:      4

which suggests that a character is 8 bits, like you suspect, but a character literal is an int.

dmckee
+1 for being interesting. People often think that sizeof("a") and sizeof("") are char*'s and should give 4 (or 8). But in fact they're char[]'s at that point (sizeof(char[11]) gives 11). A trap for newbies.
paxdiablo
A character literal is not promoted to an int, it is already an int. There is no promotion going on whatsoever if the object is an operand of the sizeof operator. If there was, this would defeat sizeof's purpose.
Chris Young
@Chris Young: Ya. Check. Thanks.
dmckee
+10  A: 

discussion on same subject

"More specifically the integral promotions. In K&R C it was virtually (?) impossible to use a character value without it being promoted to int first, so making character constant int in the first place eliminated that step. There were and still are multi character constants such as 'abcd' or however many will fit in an int."

Malx
Multi-character constants are not portable, even between compilers on a single machine (though GCC seems to be self-consistent across platforms). See: http://stackoverflow.com/questions/328215/
Jonathan Leffler
+4  A: 

I don't know the specific reasons why a character literal in C is of type int. But in C++, there is a good reason not to go that way. Consider this:

void print(int);
void print(char);

print('a');

You would expect that the call to print selects the second version taking a char. Having a character literal being an int would make that impossible. Note that in C++ literals having more than one character still have type int, although their value is implementation defined. So, 'ab' has type int, while 'a' has type char.

Johannes Schaub - litb
Yes, "Design and Evolution of C++" says overloaded input/output routines were the main reason C++ changed the rules.
Max Lybbert
Max, yeah i cheated. i looked in the standard in the compatibility section :)
Johannes Schaub - litb
+2  A: 

This is the correct behavior, called "integral promotion". It can happen in other cases too (mainly binary operators, if I remember correctly).

EDIT: Just to be sure, I checked my copy of Expert C Programming: Deep Secrets, and I confirmed that a char literal does not start with a type int. It is initially of type char but when it is used in an expression, it is promoted to an int. The following is quoted from the book:

Character literals have type int and they get there by following the rules for promotion from type char. This is too briefly covered in K&R 1, on page 39 where it says:

Every char in an expression is converted into an int....Notice that all float's in an expression are converted to double....Since a function argument is an expression, type conversions also take place when arguments are passed to functions: in particular, char and short become int, float becomes double.

PolyThinker
If the other comments are to be believed, the expression 'a' *starts out* with type int -- no type promotion is performed inside of a sizeof(). That 'a' has type int is just a quirk of C it seems.
j_random_hacker
A char literal *does* have type int. The ANSI/ISO 99 standard calls them 'integer character constants' (to differentiate them from 'wide character constants', which have type wchar_t) and specifically says, "An integer character constant has type int."
Michael Burr
What I meant was that it does not *start with* type int, but rather converted to an int from char (answer edited). Of course, this probably does not concern anyone except compiler writers since the conversion is always done.
PolyThinker
No! If you *read the ANSI/ISO 99 C standard* you will find that in C, the expression 'a' *starts with* type int. If you have a function void f(int) and a variable char c, then f(c) *will* perform integral promotion, but f('a') won't as the type of 'a' is *already* int. Strange but true.
j_random_hacker
PolyThinker
Jed
+2  A: 

I haven't seen a rationale for it (C char literals being int types), but here's something Stroustrup had to say about it (from Design and Evolution 11.2.1 - Fine-Grain Resolution):

In C, the type of a character literal such as 'a' is int. Surprisingly, giving 'a' type char in C++ doesn't cause any compatibility problems. Except for the pathological example sizeof('a'), every construct that can be expressed in both C and C++ gives the same result.

So for the most part, it should cause no problems.

Michael Burr
Interesting! Kinda contradicts what others were saying about how the C standards committee "wisely" decided not to remove this quirk from C.
j_random_hacker
A: 

This is only tangential to the language spec, but in hardware the CPU usually only has one register size -- 32 bits, let's say -- and so whenever it actually works on a char (by adding, subtracting, or comparing it) there is an implicit conversion to int when it is loaded into the register. The compiler takes care of properly masking and shifting the number after each operation so that if you add, say, 2 to (unsigned char) 254, it'll wrap around to 0 instead of 256, but inside the silicon it is really an int until you save it back to memory.

It's sort of an academic point because the language could have specified an 8-bit literal type anyway, but in this case the language spec happens to reflect more closely what the CPU is really doing.

(x86 wonks may note that there is eg a native addh op that adds the short-wide registers in one step, but inside the RISC core this translates to two steps: add the numbers, then extend sign, like an add/extsh pair on the PowerPC)

Crashworks