views:

313

answers:

3

Hi all,

I'm using Visual Studio 2008 (C++). How do I create a CString (in a non-Unicode app) from a byte array that has a string encoded in UTF8 in it?

Thanks,

kreb

EDIT: Clarification: I guess what I'm asking is.. CStringA doesn't seem to be able to interpret a UTF8 string as UTF8, but rather as ASCII or the current codepage (I think).. How do I convert this UTF8 string to a CStringW? (UTF-16..?) Thanks

A: 

The nice thing about UTF8 is that every UTF8 string is also a valid NUL-terminated C string. That means that you should be able to simply cast a pointer to the first character of the byte array as a (const char *) and pass it to CString like you would any NUL-terminated C string.

Note that unless CString is aware of UTF8 semantics (I'm not familiar enough with CString to know exactly how it works, but I suspect isn't), certain operations that make sense on an ASCII C string may give strange results for a UTF8 C string. For example, a Reverse() method that reversed the order of the bytes in the string would not do the right thing for a UTF8 string, because it would not know to keep multi-byte characters together in the original order, and would reverse the bytes of the multi-byte character.

Jeremy Friesner
A: 

For most things, you can treat UTF8 the same as ASCII.

unsigned char szUtf8String[nSize] = "whatever";
CString s = static_cast<char *>(szUtf8String);

That works for manipulating and writing to a file. However you cannot easily display the string, it will treat it as ASCII and misinterpret any non-english characters.

To display it, you will need to convert to UTF16 and possibly then back to ANSI (in the local code page).

Michael J
Thanks, how do I do this..?
krebstar
On Windows, you can use MultiByteToWideChar() and WideCharToMultiByte(). On any platform you can use mbstowcs() and wcstombs() and other related functions. The former give more control but the latter are standard C++ and available on any platform.
Michael J
+1  A: 

CStringW filename= CA2W(null_terminated_byte_buffer, CP_UTF8) should do the trick.

MSN
Thanks I'm gonna try this..
krebstar
Does this work in non-unicode apps? Doesn't seem to work.. =/ I think I'd need to use a unicode version of CFile as well.. How do I get one from a non-Unicode app?
krebstar
Please elaborate on "doesn't seem to work".
MSN
Sorry, I did this and the CString in the debugger still shows it as if it was interpreted with the local code page, that is, no change. Anyway, I tried to open a file (CFile) with this CStringW as filename but it's still that string interpreted in the local code page.. =/
krebstar
I think it's failing like this because I am opening the file with CW2A(filename).. and thus converting it back into UTF8.. Is there a way to just use the unicode versions of these functions without having to port the whole app?
krebstar
You can use CreateFileW.
MSN
Quick question.. If I have a statement like "CStringW filename = L"中文";" I can hover over the filename variable and it displays the text correctly... However if I do "CStringW filename = CA2W((LPCTSTR)buffer, CP_UTF8);" and I hover over the filename and buffer variables, they show the incorrectly interpreted text.. What is going on? It's like CA2W didn't do anything at all.. Could this mean my buffer isn't in UTF8?
krebstar
It's certainly possible. What's the byte array (in hex preferably)? You should also probably be casting to an LPCSTR since CA2W stands for ANSI to Unicode.
MSN
The byte array looks like this: [0x3] 0x7f '' unsigned char [0x4] 0x33 '3' unsigned char [0x5] 0x73 's' unsigned char [0x6] 0x68 'h' unsigned char [0x7] 0x7f '' unsigned char [0x8] 0x35 '5' unsigned char [0x9] 0x4c 'L' unsigned char [0xa] 0x36 '6' unsigned char [0xb] 0x2e '.' unsigned char [0xc] 0x70 'p' unsigned char [0xd] 0x64 'd' unsigned char [0xe] 0x66 'f' unsigned charThe actual chinese text is this 中文.pdfI will try that cast that you mentioned..
krebstar
sorry, i seem to have wasted your time.. =/ the string, in fact, does not contain a UTF8 string, but rather a string encoded by proprietary software i am working with =/.. I was mislead because the field was suffixed with UTF8.. =/
krebstar
Well when it is UTF8 you'll know what to do :)
MSN