ansaurus

Question

Answer 1

+2 A:

The problem is you specified the WC_ERR_INVALID_CHARS flag:

Windows Vista and later: Fail if an invalid input character is encountered. If this flag is not set, the function silently drops illegal code points. A call to GetLastError returns ERROR_NO_UNICODE_TRANSLATION. Note that this flag only applies when CodePage is specified as CP_UTF8 or 54936 (for Windows Vista and later). It cannot be used with other code page values.

Your conversion function seems quite long. How does this one work for you?

//----------------------------------------------------------------------------
// FUNCTION: ConvertUTF16ToUTF8
// DESC: Converts Unicode UTF-16 (Windows default) text to Unicode UTF-8.
//----------------------------------------------------------------------------
CStringA ConvertUTF16ToUTF8( __in LPCWSTR pszTextUTF16 ) {
    if (pszTextUTF16 == NULL) return "";

    int utf16len = wcslen(pszTextUTF16);
    int utf8len = WideCharToMultiByte(CP_UTF8, 0, pszTextUTF16, utf16len, 
        NULL, 0, NULL, NULL );

    CArray<CHAR> buffer;
    buffer.SetSize(utf8len+1);
    buffer.SetAt(utf8len, '\0');

    WideCharToMultiByte(CP_UTF8, 0, pszTextUTF16, utf16len, 
        buffer.GetData(), utf8len, 0, 0 );

    return buffer.GetData();
}

I see you use a function called StringCchLengthW to get the required length of the output buffer. Most of the places I look recommend using the WideCharToMultiByte function itself to tell you how many CHARs it wants.

Edit:
As Rob pointed out, you can use CW2A with the CP_UTF8 code page:

CStringA str = CW2A(wStr, CP_UTF8);

While I'm editing, I can answer your second question:

How can I verify the resultant UTF-8 string is correct?

Write it to a text file, then open it in Mozilla Firefox or an equivillant program. In the View menu, you can go to Character Encoding and switch manually to UTF-8 (assuming Firefox didn't guess it correctly to begin with). Compare it with a UTF-16 document with the same text and see if there are any differences.

Gunslinger47 2010-06-21 08:06:44

Thanks. But, don't you need to consider the null terminated \0 by int utf16len = wcslen(pszTextUTF16); ?

Yan Cheng CHEOK 2010-06-21 08:24:53

And I thought I shall decide on WC_ERR_INVALID_CHARS during compile time. It seems that it is not the case.

Yan Cheng CHEOK 2010-06-21 08:31:18

The utf16len variable is used to tell the function how many WCHARs to read. It doesn't need to read the null terminator.

Gunslinger47 2010-06-21 08:50:53

Answer 2

+2 A:

You can also use the ATL text conversion macros - to convert from UTF-16 to UTF-8 use CW2A and pass CP_UTF8 as the code page, e.g.:

CW2A utf8(buffer, CP_UTF8);
const char* data = utf8.m_psz;

Rob 2010-06-21 08:19:50

+1 I knew about those macros but I didn't know you could give code pages as a second parameter.

Gunslinger47 2010-06-21 08:28:15

Sorry. I don't want to use ATL. My code will be standard c++. (For the example I gave, I will turn them into using wstring and string instead of CString.

Yan Cheng CHEOK 2010-06-21 08:32:11

@Yan Cheng CHEOK: There's no standard C++ way to convert text encoding. You're going to use Windows API anyway. And you don't need to use all of ATL to use this macro.

Amnon 2010-06-21 09:15:30

C'mon, turning UTF-18 into UTF-8 is the most trivial encoding in the world. You can do that in 30 lines of code. Don't try it in 20, because that will break on surrogates.

MSalters 2010-06-21 09:53:28

ansaurus

tags:

views:

answers:

Convert UTF-16 to UTF-8

related questions