ansaurus

Question

Visual C++: Migrating traditional C and C++ string code to a Unicode world

Answer 1

A:

Around your literal constants with _T(), e.g. _T("Hello world")
Replace char with macros CHAR
Replace string with wstring

Then all should work.

Vitaly Dyatlov 2010-01-13 10:21:35

Even if migration to Unicode *were* as simple as search-and-replace, using `wstring` for strings, `CHAR` for characters, and string literals that might be either `char` or `wchar_t` will **NOT** work. You've got to be consistent.

dan04 2010-10-16 13:52:31

this barely works even on windows.

Pavel Radzivilovsky 2010-10-17 18:54:05

Answer 2

+2 A:

"Hello World" -> L"Hello World"

char -> wchar_t (unless you actually want char)

char * -> wchar_t *

string -> wstring

These are all platform independent. However, be aware that a wide character may be different on different platforms (two bytes on windows, four bytes on others).

Define UNICODE and _UNICODE in your project (in Visual Studio you can do this by setting the project to use Unicode in the settings). This also makes _T, TCHAR, _TEXT and TEXT macros to become L automatically. These are Microsoft specific, so avoid these if you want to be cross-platform.

villintehaspam 2010-01-13 10:22:17

Answer 3

+2 A:

I would suggest not to worry about supporting both ascii and unicode build (a-la TCHAR) and go stright to unicode. That way you get to use more of the platform independant functions (wcscpy, wcsstr etc) instead of relying onto TCHAR functions which are Micrpsoft specific.

You can use std::wstring instead of std::string and replace all chars with wchar_ts. With a massive change like this I found that you start with one thing and let the compiler guide you to the next.

One thing that I can think of that might not be obvious at run time is where a string is allocated with malloc without using sizeof operator for the underlying type. So watch out for things like char * p = (char*)malloc(11) - 10 characters plus terminating NULL, this string will be half the size it's supposed to be in wchar_ts. It should become wchar_t * p = (wchar_t*)malloc(11*sizeof(wchar_t)).

Oh and the whole TCHAR is to support compile time ASCII/Unicode strings. It's defined something like this:

#ifdef _UNICODE
#define _T(x) L ## x
#else
#define _T(x) ## x
#endif

So that in unicode configuration _T("blah") becomes L"blah" and in ascii configuration it's "blah".

Igor Zevaka 2010-01-13 10:31:32

Thanks for your useful answer. I have no real need to support both ASCII and Unicode. So, it is full steam into Unicode then :-)

Ashwin 2010-01-13 10:57:01

-1: "this string will be half the size it's supposed to be in UNICODE" is false. With wchar_t, characters may be up to 4 bytes, and it depends on the actual content.

Pavel Radzivilovsky 2010-01-13 12:23:07

That's an edge case in UTF16 encoding that will not apply to text that used to be ASCII. The point I was making was to do with converting code that assumed 1 byte = 1 character. To get that code working under UCS2 the assumption that 2 bytes = 1 character is 100% correct.

Igor Zevaka 2010-01-13 12:48:07

Changed UNICODE to `wchar_t`.

Igor Zevaka 2010-01-13 12:50:07

Answer 4

+7 A:

I recommend very much against L"", _T(), std::wstring (the latter is not multiplatform) and Microsoft recommendations on how to do Unicode.

There's a lot of confusion on this subject. Some people still think Unicode == 2 byte characters == UTF-16. Neither equality is correct.

In fact, it's possible, and even better to stay with char* and the plain std::string, plain literals and change very little (and still fully support Unicode!).

See my answer here: http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful/1855375#1855375 for how to do it the easiest (in my opinion) way.

Pavel Radzivilovsky 2010-01-13 10:57:13

A lot of Microsoft's documentation use the term "Unicode" synonymously with "UTF-16" or "UCS-2"

dreamlax 2010-01-13 11:26:22

This is for a reason. When MS first started with internationalization, it was believed that "widechar" was possible.

Pavel Radzivilovsky 2010-01-13 12:21:20

@Pavel: if the code is potentially multiplatform then this might make sense, but surely for a Windows program the Win32 and MFC support for UTF16 works well and handles a lot of the UTF encoding issues pretty painlessly.

AAT 2010-01-13 13:19:54

@AAT: I disagree wrt MS support of UTF-16. For instance, when you try to delete a 4-byte UTF-16 character in notepad, the text becomes invalid. What I suggest is converting to UTF-16 only near MFC/API calls. At least, I program only for Windows, and after hassles I prefer UTF-8 even there.

Pavel Radzivilovsky 2010-01-13 14:55:07

The "widechar" approach is possible, but only if you use a non-variable-length encoding, such that one widechar is always one character. If your characters are Unicode code points, UCS-2 won't cut it - your widechars have to be UCS-4.

caf 2010-01-14 04:15:32

@caf: I was under impression that the keyword "widechar" refers specifically to two bytes, and therefore, has no useful meaning anymore.

Pavel Radzivilovsky 2010-01-14 04:30:50

Pavel: Only in Windows-land. In UNIX it's common for `wchar_t` to be 4 bytes, and store UCS-4 encoded characters.

caf 2010-01-17 23:36:05

@caf: which is probably the reason why UNIX people like UTF-8 even more :)

Pavel Radzivilovsky 2010-01-18 16:13:37

The whole point of `wchar_t` was to have a type that can represent any character. When Unicode was expanded from 16 bits to 21, the UNIX world switched to a 32-bit type so that `wchar_t` would still comply with the standard. Windows kept it 16 bits for backwards compatibility.

dan04 2010-10-16 13:59:52

In windowsland, yes. But, windows is a legitimate OS, and if wchar_t is not good there, it is not good for multiplatform, period.

Pavel Radzivilovsky 2010-10-17 18:54:54

Answer 5

A:

Your question involves two different but related concepts. One of them is the encoding of the string (Unicode/ASCII, for example). The other is the data type to be used for the character representation.

Technically, you can have an Unicode application using plain char and std::string. You could use literals in hexadecimal ("\x5FA") or octal ("\05FA") format to specify the byte sequence of the string. Notice that with this approach your already existent string literals that contain ASCII characters should remain valid, since Unicode preserves the codes from ASCII.

One important point to observe is that many string related functions would need to be used carefully. This is because they'll be operating on bytes rather than characters. For example, std::string::operator[] might give you a particular byte that is only part of an Unicode character.

In Visual Studio wchar_t was chosen as the underlying character type. So if you're in working with Microsoft based libraries things should get easier for you if you follow many of the advices posted by others here. Replacing char for wchar_t, using the "T" macros (if you want to preserve transparency between Unicode/non-Unicode), etc.

However, I don't think there is a de facto standard of working with Unicode across libraries, since they might have different strategies to handle it.

ltcmelo 2010-01-13 11:09:16

The main problem is that Microsoft APIs do not support wchar_t properly either. It is known that in windows textbox you cannot delete some characters with one backspace, if they have more than two wchars in their encoding. Also: Unicode and ASCII are not encodings.

Pavel Radzivilovsky 2010-01-13 13:13:57

Well, maybe you're using a different meaning for the word *encoding*, but ascii, utf-8, utf-16, and others are actually character encodings. Regarding the rest of your comments... I don't see how possibly any of the comments I made could some how conflict with them. They are just additional info.

ltcmelo 2010-01-13 13:58:55

Answer 6

A:

paercebal 2010-09-07 10:55:55

ansaurus

tags:

views:

answers:

Visual C++: Migrating traditional C and C++ string code to a Unicode world

related questions