ansaurus

Question

C++ unicode UTF-16 encoding

Answer 1

A:

Just use \x instead of \u.

dan04 2010-04-21 02:48:02

I Just store it in a string, if x, and I of course can change it to u. But How can i convert it ? Thanks

Dan 2010-04-21 03:46:30

Answer 2

+2 A:

Embedding unicode in string literals is generally not a good idea and is not portable; there is no guarantee that wchar_t will be 16 bits and that the encoding will be UTF-16. While this may be the case on Windows with Microsoft Visual C++ (a particular C++ implementation), wchar_t is 32 bits on OS X's GCC (another implementation). If you have some sort of localized string constants, it's best to use a configuration file in some particular encoding and to interpret them as having been encoded in that encoding. The International Components for Unicode (ICU) library provides pretty good support for interpreting and handling unicode. Another good library for converting between (but not interpreting) encoding formats is libiconv.

Edit
It is possible I am misinterpreting your question... if the problem is that you have a string in UTF-16 already, and you want to convert it to "unicode-escape ASCII" (i.e. an ASCII string where unicode characters are represented by "\u" followed by the numeric value of the character), then use the following pseudo-code:

for each codepoint represented by the UTF-16 encoded string:
    if the codepoint is in the range [0,0x7F]:
       emit the codepoint casted to a char
    else:
       emit "\u" followed by the hexadecimal digits representing codepoint

Now, to get the codepoint, there is a very simple rule... each element in the UTF-16 string is a codepoint, unless it is part of a "surrogate pair", in which case it and the element after it comprise a single codepoint. If so, then the unicode standard defines an procedure for combinging the "leading surrogate" and the "trailing surrogate" into a single code point. Note that UTF-8 and UTF-16 are both variable-length encodings... a code point requires 32 bits if not represented with variable length. The Unicode Transformation Format (UTF) FAQ explains the encoding as well as how to identify surrogate pairs and how to combine them into codepoints.

Michael Aaron Safyan 2010-04-21 02:48:20

But it's the requirement, I have not choice, the app will only run in Windows. Can any one give me an example to convert it. By the way, the ICU site is not accessible here. Thanks

Dan 2010-04-21 03:44:35

@Dan, if you use the L"hao123--\x6211\x7684\x4E0A\x7F51\x4E3B\x9875" on Windows, then it should be a const wchar_t* string, and it should be encoded in UTF-16... you will have to figure out, though, whether it is UTF-16LE or UTF-16BE (i.e. whether it is little-endian or big-endian). I suspect it will be little-endian, but you will have to try it. I don't use Windows (I'm a *NIX guy, and I am not too fond of Microsoft for its intentional non-compliance with IEEE Std. 1003.1 as well as its intentional non-compliance with ISO C99 and other standards), so you will have to try it on your system...

Michael Aaron Safyan 2010-04-21 03:52:11

@Dan, ... if you cast the const wchar_t* to a const char*, and then print out each byte, individually, as a hexadecimal number, what do you get? If you share that, then it should be easier to answer your question.

Michael Aaron Safyan 2010-04-21 03:53:09

@Dan, also, what do you mean you don't have a choice? There are other reasons to prefer a configuration file... for example, it makes it possible to change the localization or translation without recompiling the entire program... surely your boss can be persuaded by sound logic on the merits of that approach, no?

Michael Aaron Safyan 2010-04-21 03:54:48

Because another software must use data of this format. the printf result is :68 0 61 0 6f 0 31 0 32 0 33 0 2d 0 2d 0 11 62 ffffff84 76 a 4e 51 7f 3b 4e 75 ffffff98 0 Press any key to continue . . .

Dan 2010-04-21 04:57:29

Thanks for your advices, I decide to write one function to process it by myself.

Dan 2010-04-21 05:14:33

ansaurus

tags:

views:

answers:

C++ unicode UTF-16 encoding

related questions