views:

210

answers:

7

Hello All,

At my company we have a cross platform(Linux & Windows) library that contains our own extension of the STL std::string, this class provides all sort of functionality on top of the string; split, format, to/from base64, etc. Recently we were given the requirement of making this string unicode "friendly" basically it needs to support characters from Chinese, Japanese, Arabic, etc. After initial research this seems fine on the Linux side since every thing is inherently UTF-8, however I am having trouble with the Windows side; is there a trick to getting the STL std::string to work as UTF-8 on windows? Is it even possible? Is there a better way? Ideally we would keep ourselves based on the std::string since that is what the string class is based on in Linux.

Thank you,

+2  A: 

Have you looked at std::wstring? It's a version of std::basic_string for wchar_t rather than the char that std::string uses.

Mark B
+6  A: 

Putting UTF-8 code points into an std::string should be fine regardless of platform. The problem on Windows is that almost nothing else expects or works with UTF-8 -- it expects and works with UTF-16 instead. You can switch to an std::wstring which will store UTF-16 (at least on most Windows compilers) or you can write other routines that will accept UTF-8 (probably by converting to UTF-16, and then passing through to the OS).

Jerry Coffin
tried using wstring, however the application seems to be unable to render the Unicode characters I was testing with, "大夨天太夫", so not sure what to make of that? Is there some special windows "voodoo" that I need to work in order to get wstring to work?
NSA
@NSA, you must select a font that includes the characters you wish to display. Very few fonts have a large portion of the Unicode code points covered.
Mark Ransom
@NSA - ensure that you have "Eastern languages support" enabled in Control Panel -> Regional and Language settings. Also you may be using a font that lacks these characters.
atzz
@NSA: it depends. If you try to use `cout` or `wcout`, it's pretty much a disaster. If you pass the contents of a `wstring` directly to a Windows function, things are much simpler (`printf` and such work pretty well also). From there, it's mostly a matter of ensuring that the font you use can display all the characters you care about.
Jerry Coffin
+5  A: 

There are several misconceptions in your question.

  • Neither C++ nor the STL know anything about encodings.

  • std::string is essentially a string of bytes, not characters. So you should have no problem stuffing UTF-8 encoded Unicode into it. However, keep in mind that all string functions also work on bytes, so myString.length() will give you the number of bytes, not the number of characters.

  • Linux is not inherently UTF-8. Most distributions nowadays default to UTF-8, but it should not be relied upon.

Thomas
+1  A: 

No, there is no way to make Windows treat "narrow" strings as UTF-8.

Here is what works best for me in this situation (cross-platform application that has Windows and Linux builds).

  • Use std::string in cross-platform portion of the code. Assume that it always contains UTF-8 strings.
  • In Windows portion of the code, use "wide" versions of Windows API explicitly, i.e. write e.g. CreateFileW instead of CreateFile. This allows to avoid dependency on build system configuration.
  • In the platfrom abstraction layer, convert between UTF-8 and UTF-16 where needed (MultiByteToWideChar/WideCharToMultiByte).

Other approaches that I tried but don't like much:

  • typedef std::basic_string<TCHAR> tstring; then use tstring in the business code. Wrappers/overloads can be made to streamline conversion between std::string and std::tstring, but it still adds a lot of pain.
  • Use std::wstring everywhere. Does not help much since wchar_t is 16 bit on Windows, so you either have to restrict yourself to BMP or go to a lot of complications to make the code dealing with Unicode cross-platform. In the latter case, all benefits over UTF-8 evaporate.
  • Use ATL/WTL/MFC CString in the platfrom-specific portion; use std::string in cross-platfrom portion. This is actually a variant of what I recommend above. CString is in many aspects superior to std::string (in my opinion). But it introduces an additional dependency and thus not always acceptable or convenient.
atzz
Using std::wstring does not restrict you to just the BMP. The full range of Unicode codepoints can be encoded in UTF-16, using surrogates where needed, and std::wstring can hold a UTF-16 encoded string just fine.
Remy Lebeau - TeamB
@Remy - sure. That's what I meant by "or go to a lot of complications to make the code dealing with Unicode cross-platform". On Linux, wchar_t can hold entire codepoint; on Windows, it can't. So you have to use conditional compilation and stuff. And you don't have the nice property of "one cell == one char" anymore. So why not just use UTF-8?
atzz
Try std::basic_string<int16_t> (or similar) to force a UTF-16 encoded string on all platforms without relying on the byte size of wchar_t. Also, You don't have a "one cell = one char" guarantee in UTF-8, as UTF-8 encodes a Unicode codepoint using between 1-4 codeunits, whereas UTF-16 always uses 2 codeunits. So if anything, UTF-16 can sometimes be easier to work with than UTF-8. The main benefit of UTF-8 is backwards compatibility with ASCII. For codepoints outside of ASCII, you have to deal with Unicode encodings, and for codepoints above U+07FF, UTF-8 uses more storage space than UTF-16.
Remy Lebeau - TeamB
@Remy - I never implied that there is a "one cell = one char" guarantee in UTF-8. Please read more carefully. Using std::basic_string<int16_t> will bring the disadvantages of UTF16 handling to all platforms; why do it if you don't have to? Besides, it won't work with std::streams on Windows (on some compilers at least).
atzz
+2  A: 

If you want to avoid headache, don't use the STL string types at all. C++ knows nothing about Unicode or encodings, so to be portable, it's better to use a library that is tailored for Unicode support, e.g. the ICU library. ICU uses UTF-16 strings by default, so no conversion is required, and supports conversions to many other important encodings like UTF-8. Also try to use cross-platform libraries like Boost.Filesystem for things like path manipulations (boost::wpath). Avoid std::string and std::fstream.

Philipp
+1  A: 

In the Windows API and C runtime library, char* parameters are interpreted as being encoded in the "ANSI" code page. The problem is that UTF-8 isn't supported as an ANSI code page, which I find incredibly annoying.

I'm in a similar situation, being in the middle of porting software from Windows to Linux while also making it Unicode-aware. The approach we've taken for this is:

  • Use UTF-8 as the default encoding for strings.
  • In Windows-specific code, always call the "W" version of functions, converting string arguments between UTF-8 and UTF-16 as necessary.

This is also the approach Poco has taken.

dan04
+3  A: 

Yes - by being more aware of locales and encodings.

Windows has two function calls for everything that requires text, a FoobarA() and a FoobarW(). The *W() functions take UTF-16 encoded strings, the *A() takes strings in the current codepage. However, Windows doesn't support a UTF-8 code page, so you can't directly use it in that sense with the *A() functions, nor would you want to depend on that being set by users. If you want "Unicode" in Windows, use the Unicode-capable (*W) functions. There are tutorials out there, Googling "Unicode Windows tutorial" should get you some.

If you are storing UTF-8 data in a std::string, then before you pass it off to Windows, convert it to UTF-16 (Windows provides functions for doing such), and then pass it to Windows.

Many of these problems arise from C/C++ being generally encoding-agnostic. char isn't really a character, it's just an integral type. Even using char arrays to store UTF-8 data can get you into trouble if you need to access individual code units, as char's signed-ness is left undefined by the standards. A statement like str[x] < 0x80 to check for multiple-byte characters can quickly introduce a bug. (That statement is always true if char is signed.) A UTF-8 code unit is an unsigned integral type with a range of 0-255. That maps to the C type of uint8_t exactly, although unsigned char works as well. Ideally then, I'd make a UTF-8 string an array of uint8_ts, but due to old APIs, this is rarely done.

Some people have recommended wchar_t, claiming it to be "A Unicode character type" or something like that. Again, here the standard is just as agnostic as before, as C is meant to work anywhere, and anywhere might not be using Unicode. Thus, wchar_t is no more Unicode than char. The standard states:

which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales

In Linux, a wchat_t represents a UTF-32 code unit / code point. It is thus 4 bytes. However, in Windows, it's a UTF-16 code unit, and is only 2 bytes. (Which, I would have said does not conform to the above, since 2-bytes cannot represent all of Unicode, but that's the way it works.) This size difference, and difference in data encoding, clearly puts a strain on portability. The Unicode standard itself recommends against wchar_t if you need portability. (§5.2)

The end lesson: I find it easiest to store all my data in some well-declared format. (Typically UTF-8, usually in std::string's, but I'd really like something better.) The important thing here is not the UTF-8 part, but rather, I know that my strings are UTF-8. If I'm passing them to some other API, I must also know that that API expects UTF-8 strings. If it doesn't, then I must convert them. (Thus, if I speak to Window's API, I must convert strings to UTF-16 first.) A UTF-8 text string is an "orange", and a "latin1" text string is an "apple". A char array that doesn't know what encoding it is in is a recipe for disaster.

Thanatos