Signedness of char and Unicode in C++0x

views:

174

answers:

+6 Q:

Signedness of char and Unicode in C++0x

From the C++0x working draft, the new char types (char16_t and char32_t) for handling Unicode will be unsigned (uint_least16_t and uint_least32_t will be the underlying types).

But as far as I can see (not very far perhaps) a type char8_t (based on uint_least8_t) is not defined. Why ?

And it's even more confusing when you see that a new u8 encoding prefix is introduced for UTF-8 string literal... based on old friend (sign/unsigned) char. Why ?

+1 A:

char16_t and char32_t are supposed to be usable for representing code points. Since there are no negative code points, it's sensible for these to be unsigned.

UTF-8 does not represent code points directly, so it doesn't matter whether u8's underlying type is signed or not.

Chris Jester-Young 2010-03-06 03:56:27

If I want to store the character é (U+00E9), that is the two byte sequence 0xC3 0xA9, with an array of signed char it will fail :signed char e_acute = {0xC3,0xA9} => this will truncate the value.So if your system define char as signed char, it's still a problem. Am I wrong ?

anno 2010-03-06 04:17:31

Very seldom do you need to enter the bytes manually: often, like you say, the `u8` is used. So, high bytes just get treated as negative numbers in that case.

Chris Jester-Young 2010-03-06 04:31:51

Chris, is there a guarantee that the pair of conversions `unsigned char -> signed char -> unsigned char` will yield the original value? The former conversion is implementation-defined and I couldn't find any clause that would guarantee the roundtrip.

avakar 2010-03-06 06:00:22

@avakar: I'm not sure why the roundtripping is important in this case (unless I misread your comment). The way I understand the task is this: you need a way to convert a bunch of `char` into a bunch of `char16_t` or `char32_t`. You could easily widen a `char` during this conversion.

Chris Jester-Young 2010-03-06 06:06:08

My point is that if you're receiving UTF-8 data from somewhere (as a sequence of numbers in range 0--255, which is how UTF-8 is defined), you can't reliably store them in a char array because the value you'd obtain by casting back to `unsigned char` could be different (and I'm not even sure if `CHAR_BIT` is guaranteed to be at least 8). For reliability, you have to use `uint_least8_t`, and to me it seems useful and consistent to provide `char8_t` typedef for it.

avakar 2010-03-06 06:18:00

Nah, you never interpret UTF8 directly. You pass it to a runtime support function that converts it to a native character type, like wchar_t. So it doesn't matter what kind of bag of bytes you put it in.

Hans Passant 2010-03-06 14:14:27

Reading an UTF-8 file into a signed char buffer will produce the same problem. Also if your char is signed, you can't assume that a std::string (basic_string<char>) is a valid UTF-8 string. I don't see how this is changing even with u8 ?

anno 2010-03-06 14:14:53

@avakar: Normally, you read in byte data from a file or a network. Those will usually be stored as `char` already, in whatever signedness is native to the system. So in a signed case (in the OP's example), 0xC3, 0xA9 is read in as -0x3D, -0x57 (on two's complement systems). That's fine: the conversion functions can still meaningfully promote that into an int, and process them into actual code points that way.

Chris Jester-Young 2010-03-06 15:48:19

char will be the type used for UTF-8 because it's redefined to be sure it can be used with it:

For the purpose of enhancing support for Unicode in C++ compilers, the definition of the type char has been modified to be both at least the size necessary to store an eight-bit coding of UTF-8 and large enough to contain any member of the compiler's basic execution character set. It was previously defined as only the latter. There are three Unicode encodings that C++0x will support: UTF-8, UTF-16, and UTF-32. In addition to the previously noted changes to the definition of char, C++0x will add two new character types: char16_t and char32_t. These are designed to store UTF-16 and UTF-32 respectively.

Source : http://en.wikipedia.org/wiki/C%2B%2B0x

Most of UTF-8 application uses char already anyway on PC/mac.

Klaim 2010-03-06 11:57:32

Doesn't say a word of signedness.

anno 2010-03-06 12:24:47

Ah you're right. :/

Klaim 2010-03-06 13:36:04

Why the awkward phrasing of the bold part? Isn't "eight-bit coding of UTF-8" redundant?

dan04 2010-03-17 06:50:55

Well that's wikipedia, the wording changes often and can greatly variate in quality. However I didn't found another source that summarize those unicode related features.

Klaim 2010-03-17 23:19:45

ansaurus

tags:

views:

answers:

Signedness of char and Unicode in C++0x

related questions