ansaurus

Question

What is the optimal multiplatform way of dealing with Unicode strings under C++?

Answer 1

+4 A:

I would strongly recommend using UTF-8 internally in your application, using regular old char* or std::string for data storage. For interfacing with APIs that use a different encoding (ASCII, UTF-16, etc.), I'd recommend using libiconv, which is licensed under the LGPL.

Example usage:

class TempWstring
{
public:
  TempWstring(const char *str)
  {
    assert(sUTF8toUTF16 != (iconv_t)-1);
    size_t inBytesLeft = strlen(str);
    size_t outBytesLeft = 2 * (inBytesLeft + 1);  // worst case
    mStr = new char[outBytesLeft];
    char *outBuf = mStr;
    int result = iconv(sUTF8toUTF16, &str, &inBytesLeft, &outBuf, &outBytesLeft);
    assert(result == 0 && inBytesLeft == 0);
  }

  ~TempWstring()
  {
    delete [] mStr;
  }

  const wchar_t *Str() const { return (wchar_t *)mStr; }

  static void Init()
  {
    sUTF8toUTF16 = iconv_open("UTF-16LE", "UTF-8");
    assert(sUTF8toUTF16 != (iconv_t)-1);
  }

  static void Shutdown()
  {
    int err = iconv_close(sUTF8toUTF16);
    assert(err == 0);
  }

private:
  char *mStr;

  static iconv_t sUTF8toUTF16;
};

iconv_t TempWstring::sUTF8toUTF16 = (iconv_t)-1;

// At program startup:
TempWstring::Init();

// At program termination:
TempWstring::Shutdown();

// Now, to convert a UTF-8 string to a UTF-16 string, just do this:
TempWstring x("Entr\xc3\xa9""e");  // "Entrée"
const wchar_t *ws = x.Str();  // valid until x goes out of scope

// A less contrived example:
HWND hwnd = CreateWindowW(L"class name",
                          TempWstring("UTF-8 window title").Str(),
                          dwStyle, x, y, width, height, parent, menu, hInstance, lpParam);

Adam Rosenfield 2010-01-10 17:19:37

+1, I can't agree more with utf-8 and `std::string`.

avakar 2010-01-10 17:24:18

So *every* trivial string operation requires a conversion?

Hans Passant 2010-01-10 18:44:09

You recommendation is going the EXACT opposite way of all OS. Internally Win/Mac use UTF-16 (becuase it is fixed size (not really but for most practical purposes) (really its UCS-2 but don't tell anybody)). While storage is done in UTF-8.

Martin York 2010-01-10 18:48:21

Almost all programs on modern UNIX systems use UTF-8 as internal representations for Unicode strings. (Yes yes, Cocoa likes its UCS-2 but it's not really UNIX.)

ephemient 2010-01-10 18:54:25

@Martin YorkNo, it really is UTF-16, not UCS-2. Windows started as UCS-2, but today most of the stuff is surrogate aware (I know of one thing that is not, might be more, but these are bugs, overall the thing is UTF-16)

Mihai Nita 2010-01-11 07:15:31

I think it does not go well with the concept of char type (in C++). Since in your solution "char" no longer stores a single character. Usually UTF-8 (and other variable-sized encodings) are used as external encodings and internally code should use fixed-size encoding.

Adam Badura 2010-01-11 09:26:42

Answer 2

+5 A:

Same as Adam Rosenfield answer (+1), but I use UTFCPP instead.

Klaim 2010-01-10 18:19:42

+1, interesting library, very idiomatic.

avakar 2010-01-10 20:57:02

Which works just as well with std::wstring for internal representation. Take your pick.

Jonas Byström 2010-01-30 08:33:55

Answer 3

+1 A:

I was recently on a project that decided to use std::wstring for a cross-platform project because "wide strings are Unicode, right?" This led to a number of headaches:

How big is the scalar value in a wstring? Answer: It's up to the compiler implementation. In Visual Studio (Win), it is 16 bits. But in Xcode (Mac), it is 32 bits.
This led to an unfortunate decision to use UTF-16 for communication over the wire. But which UTF-16? There are two: UTF-16BE (big-endian) and UTF16-LE (little-endian). Not being clear on this led to even more bugs.

When you are in platform-specific code, it makes sense to use the platform's native representation to communicate with its APIs. But for any code that is shared across platforms, or communicates between platforms, avoid all ambiguity and use UTF-8.

Jon Reid 2010-01-10 18:24:07

Which UTF-16 coming over the wire is easy You just make sure the BOM is sent as the first character. The receiving layer (the one above transport then re-arranges the message as required. But I agree UTF-8 for transport is easier and usually more compact (and transcoding UTF-16 -> UTF-8 is trivial).

Martin York 2010-01-10 18:51:21

Like transport on the wire. Storage is easier if you use UTF-8.

Martin York 2010-01-10 18:52:10

I think that *if* you are using UTF-16 over the wire you should stick with network endianess - this is big-endian. No need to make any protocol more complex.

Sorin Sbarnea 2010-01-10 20:45:00

@Martin, good point -- except they wouldn't have known a BOM if it came up and bit them.

Jon Reid 2010-01-10 20:54:42

Answer 4

A:

Rule of thumb: use the native platform Unicode form for processing (UTF-16 or UTF-32), and UTF-8 for data interchange (communication, storage).

If all the native APIs use UTF-16 (for instance in Windows), having your strings as UTF-8 means you will have to convert all input to UTF-16, call the Win API, then convert the answer to UTF-8. Quite a pain.

But if the main problem is the UI, the strings are the simple problem. The more difficult one is the UI framework. And for that I would recommend wxWidgets (http://www.wxWidgets.org). Supports many platforms, mature (17 years and still very active), native widgets, Unicode, liberal license.

Mihai Nita 2010-01-11 07:13:00

Answer 5

A:

I'd go for UTF16 representation in memory and UTF-8 or 16 on harddisk or wire. The main reason: UTF16 has a fixed size for each "letter". This simplifies a lot of duties when working with the string (seraching, replacing parts, ...).

The only reason for UTF-8 is the reduced memory usage for "western/latin" letters. You can use this representation for disc-storage or transportation over network. It has also the benefit that you need not worry over byte-order when loading/saving to disc/wire.

With these reasons in mind, I'd go for std::wstring internally or - if your GUI library offers a Widestring, use that (like QString from QT). And for disc-storage, I'd write a small platform independent wrapper for the platform api. Or I'd check out unicode.org if they have platformindependent code available for this conversion.

for clarification: korean / japanese letters are NOT western / latin. Japanese are for exampli Kanji. That's why I mentioned the latin character set.

for UTF-16 not being 1 character / 2 bytes. This assumption is only true for characters being on the base multilingual plane (see: http://en.wikipedia.org/wiki/UTF16). Still most user of UTF-16 assume that all characters are on the BMP. If this can't be guaranteed for your application, you can switch to UTF32 or switch to UTF8.

Still UTF-16 is used for the reasons mentioned above in a lot of APIs (e.g. Windows, QT, Java, .NET, wxWidgets)

Tobias Langner 2010-01-11 08:02:25

UTF16 does not have a fixed size for each letter.

Craig McQueen 2010-01-11 08:22:51

UTF-8 has other benefits, such as being able to be processed by the standard C string functions.

Craig McQueen 2010-01-11 08:27:54

A propos "reduced memory usage for western/latin letters": things are trickier than they seem. Wikipedia says: "For example both the Japanese and the Korean UTF-8 article on Wikipedia take more space if saved as UTF-16 than the original UTF-8 version".

Carl Seleborg 2010-01-11 08:54:44

@Carl Seleborg Yes, things are indeed trickier. The html in Wikipedia has a lot of markup that is plain ASCII. For other formats might be different.But the only way to say what takes more in memory, you really have to measure. If some browser takes the html from Wikipedia and converts it in memory to UTF-16, because that's how the browser does the job, then the original encoding is irrelevant.

Mihai Nita 2010-01-12 01:16:33

@Craig McQueen: "able to be processed by the standard C string functions"This is only true in the Unix/Linux/Mac world, and only if you don't forget to set the locale to foo_bar.UTF-8The Windows C runtime does not handle UTF-8.

Mihai Nita 2010-01-12 01:18:17

I used to run with UTF-16 before (UTF-32 in *nix). 'Twas a perfectly unbalanced coice: does not cope with all cases, isn't easy to port.

Jonas Byström 2010-01-30 08:38:06

ansaurus

tags:

views:

answers:

What is the optimal multiplatform way of dealing with Unicode strings under C++?

related questions