views:

2786

answers:

3

I've always wondered why the C++ Standard library has instantiated basic_[io]stream and all its variants using the char type instead of the unsigned char type. char means (depending on whether it is signed or not) you can have overflow and underflow for operations like get(), which will lead to implementation-defined value of the variables involved. Another example is when you want to output a byte, unformatted, to an ostream using its put function.

Any ideas?


Note: I'm still not really convinced. So if you know the definitive answer, you can still post it indeed.

+12  A: 

Possibly I've misunderstood the question, but conversion from unsigned char to char isn't unspecified, it's implementation-dependent (4.7-3 in the C++ standard).

The type of a 1-byte character in C++ is "char", not "unsigned char". This gives implementations a bit more freedom to do the best thing on the platform (for example, the standards body may have believed that there exist CPUs where signed byte arithmetic is faster than unsigned byte arithmetic, although that's speculation on my part). Also for compatibility with C. The result of removing this kind of existential uncertainty from C++ is C# ;-)

Given that the "char" type exists, I think it makes sense for the usual streams to use it even though its signedness isn't defined. So maybe your question is answered by the answer to, "why didn't C++ just define char to be unsigned?"

Steve Jessop
I thought implementation-dependent is the same as unspecified. I will correct my question and look up on the difference. thanks for telling me :)
Johannes Schaub - litb
Unspecified means the implementation can put any value it likes in there (including picking one randomly each time it happens), and not document what it does. Implementation-dependent means that the implementation must document what value it puts in there.
Steve Jessop
I heard that removing the C heritage from C++ yielded D :)
xtofl
ok thanks mate. anyway i meant if you were doing char c = foo.get(); doSomething(c); and don't care about EOF since you know you are not at the end.
Johannes Schaub - litb
That's then an issue with converting int_type to char. You can probably rely on the implementation to choose int_type such that this conversion is sensible, even if it technically could do something weird.
Steve Jessop
Hang on, I missed something out: when converting to a signed type, if the value is representable in the target type, then the resulting value is unchanged (also 4.7.3). The return from get is defined to be either a character value or eof, so if it's not eof then the conversion is defined.
Steve Jessop
indeed, the implementation-defined'ness is only triggered for values >CHAR_MAX. that can happen in std::istringstream a("\xa4"); char c = a.get(); , if CHAR_MAX is 127 for example. but indeed the correct way is using int. but most will just use char anyway, since they don't know about this. -.-
Johannes Schaub - litb
+4  A: 

char is for characters, unsigned char for raw bytes of data, and signed chars for, well, signed data.

Standard does not specify if signed or unsigned char will be used for the implementation of char - it is compiler-specific. It only specifies that the "char" will be "enough" to hold characters on you system - the way characters were in those days, which is, no UNICODE.

Using "char" for characters is the standard way to go. Using unsigned char is a hack, although it'll match compiler's implementation of char on most platforms.

n-alexander
"Using "char" for characters is the standard way to go. Using unsigned char is a hack", how is that? Streams are not only for exchanging basic characters, but also for exchanging binary data (after all, that's what `ios_base::binary` is for). Would it use `unsigned char`, we would not have to care about negative char values at all, and always get positive values back. It would seem to be so much nicer.
Johannes Schaub - litb
+3  A: 

I have always understood it this way: the purpose of the iostream class is to read and/or write a stream of characters, which, if you think about it, are abstract entities that are only represented by the computer using a character encoding. The C++ standard makes great pains to avoid pinning down the character encoding, saying only that "Objects declared as characters (char) shall be large enough to store any member of the implementation's basic character set," because it doesn't need to force the "implementation basic character set" to define the C++ language; the standard can leave the decision of which character encoding is used to the implementation (compiler together with an STL implementation), and just note that char objects represent single characters in some encoding.

An implementation writer could choose a single-octet encoding such as ISO-8859-1 or even a double-octet encoding such as UCS-2. It doesn't matter. As long as a char object is "large enough to store any member of the implementation's basic character set" (note that this explicitly forbids variable-length encodings), then the implementation may even choose an encoding that represents basic Latin in a way that is incompatible with any common encoding!

It is confusing that the char, signed char, and unsigned char types share "char" in their names, but it is important to keep in mind that char does not belong to the same family of fundamental types as signed char and unsigned char. signed char is in the family of signed integer types:

There are four signed integer types: "signed char", "short int", "int", and "long int."

and unsigned char is in the family of unsigned integer types:

For each of the signed integer types, there exists a corresponding (but different) unsigned integer type: "unsigned char", "unsigned short int", "unsigned int", and "unsigned long int," ...

The one similarity between the char, signed char, and unsigned char types is that "[they] occupy the same amount of storage and have the same alignment requirements". Thus, you can reinterpret_cast a char object to an unsigned char integer object to determine the numeric value of the character in the execution character set.

To answer your question, the reason why the STL uses char as the default type is because the standard streams are meant for reading and/or writing streams of characters, represented by char objects, not integers (signed char and unsigned char). The use of char versus the numeric value is a way of separating concerns.

Daniel Trebbien