views:

512

answers:

5

I know that there are already several questions on StackOverflow about std::string versus std::wstring or similar but none of them proposed a full solution.

In order to obtain a good answer I should define the requirements:

  • multiplatform usage, must work on Windows, OS X and Linux
  • minimal effort for conversion to/from platform specific Unicode strings like CFStringRef, wchar_t *, char* as UTF-8 or other types as they are required by OS API. Remark: I don't need code-page convertion support because I expect to use only Unicode compatible functions on all operating systems supported.
  • if requires an external library, this one should be open-source and under a very liberal license like BSD but not LGPL.
  • be able to use a printf format syntax or similar.
  • easy way of string allocation/deallocation
  • performance is not very important because I assume that the Unicode strings are used only for application UI.
  • some example could would be appreciated

I would really appreciate only one proposed solution per answer, by doing this people may vote for their prefered alternative. If you have more than one alternative just add another answer.

Please indicate something that did worked for you.

Related questions:

+4  A: 

I would strongly recommend using UTF-8 internally in your application, using regular old char* or std::string for data storage. For interfacing with APIs that use a different encoding (ASCII, UTF-16, etc.), I'd recommend using libiconv, which is licensed under the LGPL.

Example usage:

class TempWstring
{
public:
  TempWstring(const char *str)
  {
    assert(sUTF8toUTF16 != (iconv_t)-1);
    size_t inBytesLeft = strlen(str);
    size_t outBytesLeft = 2 * (inBytesLeft + 1);  // worst case
    mStr = new char[outBytesLeft];
    char *outBuf = mStr;
    int result = iconv(sUTF8toUTF16, &str, &inBytesLeft, &outBuf, &outBytesLeft);
    assert(result == 0 && inBytesLeft == 0);
  }

  ~TempWstring()
  {
    delete [] mStr;
  }

  const wchar_t *Str() const { return (wchar_t *)mStr; }

  static void Init()
  {
    sUTF8toUTF16 = iconv_open("UTF-16LE", "UTF-8");
    assert(sUTF8toUTF16 != (iconv_t)-1);
  }

  static void Shutdown()
  {
    int err = iconv_close(sUTF8toUTF16);
    assert(err == 0);
  }

private:
  char *mStr;

  static iconv_t sUTF8toUTF16;
};

iconv_t TempWstring::sUTF8toUTF16 = (iconv_t)-1;

// At program startup:
TempWstring::Init();

// At program termination:
TempWstring::Shutdown();

// Now, to convert a UTF-8 string to a UTF-16 string, just do this:
TempWstring x("Entr\xc3\xa9""e");  // "Entrée"
const wchar_t *ws = x.Str();  // valid until x goes out of scope

// A less contrived example:
HWND hwnd = CreateWindowW(L"class name",
                          TempWstring("UTF-8 window title").Str(),
                          dwStyle, x, y, width, height, parent, menu, hInstance, lpParam);
Adam Rosenfield
+1, I can't agree more with utf-8 and `std::string`.
avakar
So *every* trivial string operation requires a conversion?
Hans Passant
You recommendation is going the EXACT opposite way of all OS. Internally Win/Mac use UTF-16 (becuase it is fixed size (not really but for most practical purposes) (really its UCS-2 but don't tell anybody)). While storage is done in UTF-8.
Martin York
Almost all programs on modern UNIX systems use UTF-8 as internal representations for Unicode strings. (Yes yes, Cocoa likes its UCS-2 but it's not really UNIX.)
ephemient
@Martin YorkNo, it really is UTF-16, not UCS-2. Windows started as UCS-2, but today most of the stuff is surrogate aware (I know of one thing that is not, might be more, but these are bugs, overall the thing is UTF-16)
Mihai Nita
I think it does not go well with the concept of char type (in C++). Since in your solution "char" no longer stores a single character. Usually UTF-8 (and other variable-sized encodings) are used as external encodings and internally code should use fixed-size encoding.
Adam Badura
+5  A: 

Same as Adam Rosenfield answer (+1), but I use UTFCPP instead.

Klaim
+1, interesting library, very idiomatic.
avakar
Which works just as well with std::wstring for internal representation. Take your pick.
Jonas Byström
+1  A: 

I was recently on a project that decided to use std::wstring for a cross-platform project because "wide strings are Unicode, right?" This led to a number of headaches:

  • How big is the scalar value in a wstring? Answer: It's up to the compiler implementation. In Visual Studio (Win), it is 16 bits. But in Xcode (Mac), it is 32 bits.
  • This led to an unfortunate decision to use UTF-16 for communication over the wire. But which UTF-16? There are two: UTF-16BE (big-endian) and UTF16-LE (little-endian). Not being clear on this led to even more bugs.

When you are in platform-specific code, it makes sense to use the platform's native representation to communicate with its APIs. But for any code that is shared across platforms, or communicates between platforms, avoid all ambiguity and use UTF-8.

Jon Reid
Which UTF-16 coming over the wire is easy You just make sure the BOM is sent as the first character. The receiving layer (the one above transport then re-arranges the message as required. But I agree UTF-8 for transport is easier and usually more compact (and transcoding UTF-16 -> UTF-8 is trivial).
Martin York
Like transport on the wire. Storage is easier if you use UTF-8.
Martin York
I think that *if* you are using UTF-16 over the wire you should stick with network endianess - this is big-endian. No need to make any protocol more complex.
Sorin Sbarnea
@Martin, good point -- except they wouldn't have known a BOM if it came up and bit them.
Jon Reid
A: 

Rule of thumb: use the native platform Unicode form for processing (UTF-16 or UTF-32), and UTF-8 for data interchange (communication, storage).

If all the native APIs use UTF-16 (for instance in Windows), having your strings as UTF-8 means you will have to convert all input to UTF-16, call the Win API, then convert the answer to UTF-8. Quite a pain.

But if the main problem is the UI, the strings are the simple problem. The more difficult one is the UI framework. And for that I would recommend wxWidgets (http://www.wxWidgets.org). Supports many platforms, mature (17 years and still very active), native widgets, Unicode, liberal license.

Mihai Nita
A: 

I'd go for UTF16 representation in memory and UTF-8 or 16 on harddisk or wire. The main reason: UTF16 has a fixed size for each "letter". This simplifies a lot of duties when working with the string (seraching, replacing parts, ...).

The only reason for UTF-8 is the reduced memory usage for "western/latin" letters. You can use this representation for disc-storage or transportation over network. It has also the benefit that you need not worry over byte-order when loading/saving to disc/wire.

With these reasons in mind, I'd go for std::wstring internally or - if your GUI library offers a Widestring, use that (like QString from QT). And for disc-storage, I'd write a small platform independent wrapper for the platform api. Or I'd check out unicode.org if they have platformindependent code available for this conversion.


for clarification: korean / japanese letters are NOT western / latin. Japanese are for exampli Kanji. That's why I mentioned the latin character set.


for UTF-16 not being 1 character / 2 bytes. This assumption is only true for characters being on the base multilingual plane (see: http://en.wikipedia.org/wiki/UTF16). Still most user of UTF-16 assume that all characters are on the BMP. If this can't be guaranteed for your application, you can switch to UTF32 or switch to UTF8.

Still UTF-16 is used for the reasons mentioned above in a lot of APIs (e.g. Windows, QT, Java, .NET, wxWidgets)

Tobias Langner
UTF16 does not have a fixed size for each letter.
Craig McQueen
UTF-8 has other benefits, such as being able to be processed by the standard C string functions.
Craig McQueen
A propos "reduced memory usage for western/latin letters": things are trickier than they seem. Wikipedia says: "For example both the Japanese and the Korean UTF-8 article on Wikipedia take more space if saved as UTF-16 than the original UTF-8 version".
Carl Seleborg
@Carl Seleborg Yes, things are indeed trickier. The html in Wikipedia has a lot of markup that is plain ASCII. For other formats might be different.But the only way to say what takes more in memory, you really have to measure. If some browser takes the html from Wikipedia and converts it in memory to UTF-16, because that's how the browser does the job, then the original encoding is irrelevant.
Mihai Nita
@Craig McQueen: "able to be processed by the standard C string functions"This is only true in the Unix/Linux/Mac world, and only if you don't forget to set the locale to foo_bar.UTF-8The Windows C runtime does not handle UTF-8.
Mihai Nita
I used to run with UTF-16 before (UTF-32 in *nix). 'Twas a perfectly unbalanced coice: does not cope with all cases, isn't easy to port.
Jonas Byström