views:

528

answers:

6

I see that Visual Studio 2008 and later now start off a new solution with the Character Set set to Unicode. My old C++ code deals with only English ASCII text and is full of:

  • Literal strings like "Hello World"
  • char type
  • char * pointers to allocated C strings
  • STL string type
  • Conversions from STL string to C string and vice versa using STL string constructor (which accepts const char *) and STL string.c_str()

    1. What are the changes I need to make to migrate this code so that it works in an ecosystem of Visual Studio Unicode and Unicode enabled libraries? (I have no real need for it work with both ASCII and Unicode, it can be pure Unicode.)

    2. Is it also possible to do this in a platform independent way? (i.e., by not using Microsoft types.)

I see so many wide character and Unicode types and conversions scattered around, hence my confusion. (Ex: wchar_t, TCHAR, _T, _TEXT, TEXT etc.)

A: 
  • Around your literal constants with _T(), e.g. _T("Hello world")
  • Replace char with macros CHAR
  • Replace string with wstring

Then all should work.

Vitaly Dyatlov
Even if migration to Unicode *were* as simple as search-and-replace, using `wstring` for strings, `CHAR` for characters, and string literals that might be either `char` or `wchar_t` will **NOT** work. You've got to be consistent.
dan04
this barely works even on windows.
Pavel Radzivilovsky
+2  A: 

"Hello World" -> L"Hello World"

char -> wchar_t (unless you actually want char)

char * -> wchar_t *

string -> wstring

These are all platform independent. However, be aware that a wide character may be different on different platforms (two bytes on windows, four bytes on others).

Define UNICODE and _UNICODE in your project (in Visual Studio you can do this by setting the project to use Unicode in the settings). This also makes _T, TCHAR, _TEXT and TEXT macros to become L automatically. These are Microsoft specific, so avoid these if you want to be cross-platform.

villintehaspam
+2  A: 

I would suggest not to worry about supporting both ascii and unicode build (a-la TCHAR) and go stright to unicode. That way you get to use more of the platform independant functions (wcscpy, wcsstr etc) instead of relying onto TCHAR functions which are Micrpsoft specific.

You can use std::wstring instead of std::string and replace all chars with wchar_ts. With a massive change like this I found that you start with one thing and let the compiler guide you to the next.

One thing that I can think of that might not be obvious at run time is where a string is allocated with malloc without using sizeof operator for the underlying type. So watch out for things like char * p = (char*)malloc(11) - 10 characters plus terminating NULL, this string will be half the size it's supposed to be in wchar_ts. It should become wchar_t * p = (wchar_t*)malloc(11*sizeof(wchar_t)).

Oh and the whole TCHAR is to support compile time ASCII/Unicode strings. It's defined something like this:

#ifdef _UNICODE
#define _T(x) L ## x
#else
#define _T(x) ## x
#endif

So that in unicode configuration _T("blah") becomes L"blah" and in ascii configuration it's "blah".

Igor Zevaka
Thanks for your useful answer. I have no real need to support both ASCII and Unicode. So, it is full steam into Unicode then :-)
Ashwin
-1: "this string will be half the size it's supposed to be in UNICODE" is false. With wchar_t, characters may be up to 4 bytes, and it depends on the actual content.
Pavel Radzivilovsky
That's an edge case in UTF16 encoding that will not apply to text that used to be ASCII. The point I was making was to do with converting code that assumed 1 byte = 1 character. To get that code working under UCS2 the assumption that 2 bytes = 1 character is 100% correct.
Igor Zevaka
Changed UNICODE to `wchar_t`.
Igor Zevaka
+7  A: 

I recommend very much against L"", _T(), std::wstring (the latter is not multiplatform) and Microsoft recommendations on how to do Unicode.

There's a lot of confusion on this subject. Some people still think Unicode == 2 byte characters == UTF-16. Neither equality is correct.

In fact, it's possible, and even better to stay with char* and the plain std::string, plain literals and change very little (and still fully support Unicode!).

See my answer here: http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful/1855375#1855375 for how to do it the easiest (in my opinion) way.

Pavel Radzivilovsky
A lot of Microsoft's documentation use the term "Unicode" synonymously with "UTF-16" or "UCS-2"
dreamlax
This is for a reason. When MS first started with internationalization, it was believed that "widechar" was possible.
Pavel Radzivilovsky
@Pavel: if the code is potentially multiplatform then this might make sense, but surely for a Windows program the Win32 and MFC support for UTF16 works well and handles a lot of the UTF encoding issues pretty painlessly.
AAT
@AAT: I disagree wrt MS support of UTF-16. For instance, when you try to delete a 4-byte UTF-16 character in notepad, the text becomes invalid. What I suggest is converting to UTF-16 only near MFC/API calls. At least, I program only for Windows, and after hassles I prefer UTF-8 even there.
Pavel Radzivilovsky
The "widechar" approach is possible, but only if you use a non-variable-length encoding, such that one widechar is always one character. If your characters are Unicode code points, UCS-2 won't cut it - your widechars have to be UCS-4.
caf
@caf: I was under impression that the keyword "widechar" refers specifically to two bytes, and therefore, has no useful meaning anymore.
Pavel Radzivilovsky
Pavel: Only in Windows-land. In UNIX it's common for `wchar_t` to be 4 bytes, and store UCS-4 encoded characters.
caf
@caf: which is probably the reason why UNIX people like UTF-8 even more :)
Pavel Radzivilovsky
The whole point of `wchar_t` was to have a type that can represent any character. When Unicode was expanded from 16 bits to 21, the UNIX world switched to a 32-bit type so that `wchar_t` would still comply with the standard. Windows kept it 16 bits for backwards compatibility.
dan04
In windowsland, yes. But, windows is a legitimate OS, and if wchar_t is not good there, it is not good for multiplatform, period.
Pavel Radzivilovsky
A: 

Your question involves two different but related concepts. One of them is the encoding of the string (Unicode/ASCII, for example). The other is the data type to be used for the character representation.

Technically, you can have an Unicode application using plain char and std::string. You could use literals in hexadecimal ("\x5FA") or octal ("\05FA") format to specify the byte sequence of the string. Notice that with this approach your already existent string literals that contain ASCII characters should remain valid, since Unicode preserves the codes from ASCII.

One important point to observe is that many string related functions would need to be used carefully. This is because they'll be operating on bytes rather than characters. For example, std::string::operator[] might give you a particular byte that is only part of an Unicode character.

In Visual Studio wchar_t was chosen as the underlying character type. So if you're in working with Microsoft based libraries things should get easier for you if you follow many of the advices posted by others here. Replacing char for wchar_t, using the "T" macros (if you want to preserve transparency between Unicode/non-Unicode), etc.

However, I don't think there is a de facto standard of working with Unicode across libraries, since they might have different strategies to handle it.

ltcmelo
The main problem is that Microsoft APIs do not support wchar_t properly either. It is known that in windows textbox you cannot delete some characters with one backspace, if they have more than two wchars in their encoding. Also: Unicode and ASCII are not encodings.
Pavel Radzivilovsky
Well, maybe you're using a different meaning for the word *encoding*, but ascii, utf-8, utf-16, and others are actually character encodings. Regarding the rest of your comments... I don't see how possibly any of the comments I made could some how conflict with them. They are just additional info.
ltcmelo
A: 
paercebal