views:

256

answers:

5

My OS is Debian, my default locale is UTF-8 and my compiler is gcc. By default CHAR_BIT in limits.h is 8 which is ok for ASCII because in ASCII 1 char = 8 bits. But since I am using UTF-8, chars can be up to 32 bits which contradicts the CHAR_BIT default value of 8.

If I modify CHAR_BIT to 32 in limits.h to better suit UTF-8, what do I have to do in order for this new value to come into effect ? I guess I have to recompile gcc ? Do I have to recompile the linux kernel ? What about the default installed Debian packages, will they work ?

+1  A: 

UTF-8 encodes 1 character in several bytes.

Also, do not edit your system header files. (and no, modifying CHAR_BITS will not work, recompiling the kernel/gcc or whatnot).

nos
Yes, so strlen() would say that the Euro sign (3 bytes) has a length of 3 chars which is not true; it is 1 char long. If I modify CHAR_BIT to 32, will this behavior be corrected ?
bobby
No, @bobby. You need to use a Unicode-aware library, like ICU.
Matthew Flaschen
strlen correctly says that your 3 byte euro sign is 3 bytes. strlen doesn't count characters, it count bytes (conveniently they're the same for ascii though). You very rarly need to know the no. of characters, unless you're writing screen display or layout stuff. use wchar_t if you you do, or a convenient UTF-8 library, as others suggests.Do not ever modify CHAR_BITS. Do not think about modifying your system header files.
nos
Bobby, counting characters is far from being trivial and strlen does not do it as almost any other API you met. See as an example: word "שָלוֹם" consists of 4 characters but 6 Unicode code points.
Artyom
While indeed wcslen might not count the no of characters for UTF-16 either, it will for all practical purposes on linux(until unicode goes over 32 bits.. ) - as wchar_ts being utf-32 there.
nos
You're right, @nos. All characters will fit in a UTF-32 wchar_t. I should have specified that I meant on Windows.
Matthew Flaschen
+1  A: 

I'm pretty sure that CHAR_BIT is the number of bits in the 'char' variable type, not the maximum number of bits in any character. As you noticed it's a constant in limits.h, which doesn't change based on the locale settings.

CHAR_BIT will equal 8 on any reasonably new / sane system... non 8-bit words is rare these days :)

Steven Schlansker
`CHAR_BIT` is guaranteed to never be *less* than 8, so it's safe for UTF-8 data.
dan04
+3  A: 

You do not need char to be 32 bits to have UTF-8 encoding. UTF-8 is variable length encoding and it is designed for characters of 8-bit and it is backward compatible to ascii.

You may also use wchar_t that is 32 bit (on Linux) but generally it would not you give to much added value because Unicode processing is much more complicated then just code-points management.

Artyom
I would really like to stay with char rather than use wchar_t.
bobby
So stay with is. UTF-8 is perfectly fine Unicode encoding.
Artyom
+4  A: 

CHAR_BIT is the number of bits in a char; never, ever change this. It is not going to have the effect you want.

Instead, work with strings of UTF-8 encoded chars, or use strings of wchar_t if you want to store Unicode characters directly.*

* Small print: The size of wchar_t is system-dependent as well. On Windows with MSVC, it's only 16 bits, which is only sufficient for the Basic Multilingual Plane. You can use it with UTF-16, though, which plays nice with the Windows API. On most other systems, wchar_t gives you the full 32 bits.

Thomas
Not the effect I want? What will the effect be?
bobby
Breaking a whole lot of code that assumes `CHAR_BIT == 8`.
dan04
@bobby: The size of `char` will not change. It will still be 8 bits. The effect will be that code using `CHAR_BIT` for the number of bits in a `char` will break, as dan04 said.
Thomas
A: 

C and C++ define char as a byte, i.e., the integer type for which sizeof returns 1. It doesn't have to be 8 bits, but the overwhelming majority of the time, it is. IMHO, it should have been named byte. But back in 1972 when C was created, Westerners didn't have to deal with multi-byte character encodings, so you could get away with conflating the "character" and "byte" types.

You just have to live with the confusing terminology. Or typedef it away. But don't edit your system header files. If you want a character type instead of a byte type, use wchar_t.

But a UTF-8 string is made of 8-bit code units, so char will work just fine. You just have to remember the distinction between char and character. For example, don't do this:

void make_upper_case(char* pstr)
{
   while (*pstr != '\0')
   {
      *pstr = toupper(*pstr);
      pstr++;
   }
}

toupper('a') works as expected, but toupper('\xC3') is a nonsensical attempt to uppercase half of a character.

dan04