tags:

views:

180

answers:

2

Hello,

I want an option to convert a string to wide string with two different behaviors:

  1. Ignore illegal characters
  2. Abort conversion if illegal character occurs:

On Windows XP I could do this:

bool ignore_illegal; // input

DWORD flags = ignore_illegal ? 0 : MB_ERR_INVALID_CHARS;

SetLastError(0);

int res = MultiByteToWideChar(CP_UTF8,flags,"test\xFF\xFF test",-1,buf,sizeof(buf));
int err = GetLastError();

std::cout << "result = " << res << " get last error = " << err; 

Now, on XP if ignore illegal is true characters I would get:

result = 10 get last error = 0

And in case of ignore illegal is false I get

result = 0 get last error = 1113 // invalid code

So, given big enough buffer it is enough to check result != 0 ;

According to documentation http://msdn.microsoft.com/en-us/library/dd319072(VS.85).aspx there are API changes, so how does this changes on Vista?

A: 
WCHAR *pstrRet = NULL;

int nLen = MultiByteToWideChar(CP_UTF8, 0, pstrTemp2, -1, NULL, 0);

pstrRet = new WCHAR[nLen];

int nConv = MultiByteToWideChar(CP_UTF8, 0, pstrTemp2, -1, pstrRet, nLen);

if (nConv == nLen)

{

// Success! pstrRet should be the wide char equivelant of pstrTemp2

}

if (pstrRet)

delete[] pstrRet;

I think this is way it is used it on vista found on some forum :)

Arjit
This isn't what I was asking for. I'm asking about error handling in case of invalid characters
Artyom
+2  A: 

I think what it does is replacing illegal code units by the replacement character (U+FFFD), as mandated by the Unicode standard. The following code

#define STRICT
#define UNICODE
#define NOMINMAX
#define WIN32_LEAN_AND_MEAN

#include <windows.h>

#include <cstdlib>
#include <iostream>
#include <iomanip>


void test(bool ignore_illegal) {
    const DWORD flags = ignore_illegal ? 0 : MB_ERR_INVALID_CHARS;
    WCHAR buf[0x100];
    SetLastError(0);
    const int res = MultiByteToWideChar(CP_UTF8, flags, "test\xFF\xFF test", -1, buf, sizeof buf);
    const DWORD err = GetLastError();
    std::cout << "ignore_illegal = " << std::boolalpha << ignore_illegal
        << ", result = " << std::dec << res
        << ", last error = " << err
        << ", fifth code unit = " << std::hex << static_cast<unsigned int>(buf[5])
        << std::endl;
}


int main() {
    test(false);
    test(true);
    std::system("pause");
}

produces the following output on my Windows 7 system:

ignore_illegal = false, result = 0, last error = 1113, fifth code unit = fffd
ignore_illegal = true, result = 12, last error = 0, fifth code unit = fffd

So the error codes stay the same, but the length is off by two, indicating the two replacement code points that have been inserted. If you run my code on XP, the fifth code point should be U+0020 (the space character) if the two illegal code units have been dropped.

Philipp
Thanks, that what I was looking for. Is there any mention in documentation of this feature?
Artyom
Unfortunately not. The documentation only says that the function "does not drop illegal code points", but not what it does instead. The Unicode standard doesn't define how to treat illegal code unit sequences—it merely requires that they be not interpreted as characters, but that any legal code unit sequence must be interpreted as such. So signaling an error, deleting the offending code unit sequences or replacing them with a replacement character are legal. I think I'll add a note to the comments of the documentation page.
Philipp
@Philipp Thank you very much once again!
Artyom
@Philipp - hello, I had awarded the bounty. Sorry for delay, just stackoverflow changed the UI and I thought that you only need to accept the answer rather then clicking on "+XX" button. Thanks
Artyom