ansaurus

Question

Answer 1

+1 A:

UTF-8 encodes 1 character in several bytes.

Also, do not edit your system header files. (and no, modifying CHAR_BITS will not work, recompiling the kernel/gcc or whatnot).

nos 2010-04-18 18:16:17

Yes, so strlen() would say that the Euro sign (3 bytes) has a length of 3 chars which is not true; it is 1 char long. If I modify CHAR_BIT to 32, will this behavior be corrected ?

bobby 2010-04-18 18:23:10

No, @bobby. You need to use a Unicode-aware library, like ICU.

Matthew Flaschen 2010-04-18 18:32:00

strlen correctly says that your 3 byte euro sign is 3 bytes. strlen doesn't count characters, it count bytes (conveniently they're the same for ascii though). You very rarly need to know the no. of characters, unless you're writing screen display or layout stuff. use wchar_t if you you do, or a convenient UTF-8 library, as others suggests.Do not ever modify CHAR_BITS. Do not think about modifying your system header files.

nos 2010-04-18 18:38:41

Bobby, counting characters is far from being trivial and strlen does not do it as almost any other API you met. See as an example: word "שָלוֹם" consists of 4 characters but 6 Unicode code points.

Artyom 2010-04-18 18:51:17

While indeed wcslen might not count the no of characters for UTF-16 either, it will for all practical purposes on linux(until unicode goes over 32 bits.. ) - as wchar_ts being utf-32 there.

nos 2010-04-18 19:00:53

You're right, @nos. All characters will fit in a UTF-32 wchar_t. I should have specified that I meant on Windows.

Matthew Flaschen 2010-04-18 19:01:38

Answer 2

+1 A:

I'm pretty sure that CHAR_BIT is the number of bits in the 'char' variable type, not the maximum number of bits in any character. As you noticed it's a constant in limits.h, which doesn't change based on the locale settings.

CHAR_BIT will equal 8 on any reasonably new / sane system... non 8-bit words is rare these days :)

Steven Schlansker 2010-04-18 18:17:35

`CHAR_BIT` is guaranteed to never be *less* than 8, so it's safe for UTF-8 data.

dan04 2010-04-18 19:10:56

Answer 3

+3 A:

You do not need char to be 32 bits to have UTF-8 encoding. UTF-8 is variable length encoding and it is designed for characters of 8-bit and it is backward compatible to ascii.

You may also use wchar_t that is 32 bit (on Linux) but generally it would not you give to much added value because Unicode processing is much more complicated then just code-points management.

Artyom 2010-04-18 18:18:03

I would really like to stay with char rather than use wchar_t.

bobby 2010-04-18 18:26:07

So stay with is. UTF-8 is perfectly fine Unicode encoding.

Artyom 2010-04-18 18:34:49

Answer 4

+4 A:

CHAR_BIT is the number of bits in a char; never, ever change this. It is not going to have the effect you want.

Instead, work with strings of UTF-8 encoded chars, or use strings of wchar_t if you want to store Unicode characters directly.*

* Small print: The size of wchar_t is system-dependent as well. On Windows with MSVC, it's only 16 bits, which is only sufficient for the Basic Multilingual Plane. You can use it with UTF-16, though, which plays nice with the Windows API. On most other systems, wchar_t gives you the full 32 bits.

Thomas 2010-04-18 18:18:06

Not the effect I want? What will the effect be?

bobby 2010-04-18 18:26:39

Breaking a whole lot of code that assumes `CHAR_BIT == 8`.

dan04 2010-04-18 19:05:55

@bobby: The size of `char` will not change. It will still be 8 bits. The effect will be that code using `CHAR_BIT` for the number of bits in a `char` will break, as dan04 said.

Thomas 2010-04-18 19:49:54

Answer 5

A:

C and C++ define char as a byte, i.e., the integer type for which sizeof returns 1. It doesn't have to be 8 bits, but the overwhelming majority of the time, it is. IMHO, it should have been named byte. But back in 1972 when C was created, Westerners didn't have to deal with multi-byte character encodings, so you could get away with conflating the "character" and "byte" types.

You just have to live with the confusing terminology. Or typedef it away. But don't edit your system header files. If you want a character type instead of a byte type, use wchar_t.

But a UTF-8 string is made of 8-bit code units, so char will work just fine. You just have to remember the distinction between char and character. For example, don't do this:

void make_upper_case(char* pstr)
{
   while (*pstr != '\0')
   {
      *pstr = toupper(*pstr);
      pstr++;
   }
}

toupper('a') works as expected, but toupper('\xC3') is a nonsensical attempt to uppercase half of a character.

dan04 2010-04-18 19:04:55

ansaurus

tags:

views:

answers:

gcc, UTF-8 and limits.h

related questions