tags:

views:

1375

answers:

2

I am writing a small app which I need to test with utf-8 characters of different number of byte lengths.

I can input unicode characters to test that are encoded in utf-8 with 1,2 and 3 bytes just fine by doing, for example:

string in = "pi = \u3a0";

But how do I get a unicode character that is encoded with 4-bytes? I have tried:

string in = "aegan check mark = \u10102";

Which as far as I understand should be outputting . But when I print that out I get ᴶ0

What am I missing?

EDIT:

I got it to work by adding leading zeros:

string in = "\U00010102";

Wish I had thought of that sooner :)

+4  A: 

There's a longer form of escape in the pattern \U followed by eight digits, rather than \u followed by four digits. This is also used in Java and Python, amongst others:

>>> '\xf0\x90\x84\x82'.decode("UTF-8")
u'\U00010102'

However, if you are using byte strings, why not just escape each byte like above, rather than relying on the compiler to convert the escape to a UTF-8 string? This would seem to be more portable as well - if I compile the following program:

#include <iostream>
#include <string>

int main()
{
    std::cout << "narrow: " << std::string("\uFF0E").length() <<
        " utf8: " << std::string("\xEF\xBC\x8E").length() <<
        " wide: " << std::wstring(L"\uFF0E").length() << std::endl;

    std::cout << "narrow: " << std::string("\U00010102").length() <<
        " utf8: " << std::string("\xF0\x90\x84\x82").length() <<
        " wide: " << std::wstring(L"\U00010102").length() << std::endl;
}

On win32 with my current options cl gives:

warning C4566: character represented by universal-character-name '\UD800DD02' cannot be represented in the current code page (932)

The compiler tries to convert all unicode escapes in byte strings to the system code page, which unlike UTF-8 cannot represent all unicode characters. Oddly it has understood that \U00010102 is \uD800\uDD02 in UTF-16 (its internal unicode representation) and mangled the escape in the error message...

When run, the program prints:

narrow: 2 utf8: 3 wide: 1
narrow: 2 utf8: 4 wide: 2

Note that the UTF-8 bytestrings and the wide strings are correct, but the compiler failed to convert "\U00010102", giving the byte string "??", an incorrect result.

gz
Contrary to your second sentence, `\Uxxxxxxxx` is not used in Java: http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.3
T.J. Crowder
A: 

See some examples here

Nemanja Trifunovic