views:

79

answers:

2

I have the lovely functions from my previous question, which work fine if I do this:

wstring temp;
wcin >> temp;

string whatever( toUTF8(getSomeWString()) );

// store whatever, copy, but do not use it as UTF8 (see below)

wcout << toUTF16(whatever) << endl;

The original form is reproduced, but the in between form often contains extra characters. If I enter for example àçé as the input, and add a cout << whatever statement, i'll get ┬à┬ç┬é as output.

Can I still use this string to compare to others, procured from an ASCII source? Or asked differently: if I would output ┬à┬ç┬é through the UTF8 cout in linux, would it read àçé? Is the byte content of a string àçé, read in UTF8 linux by cin, exactly the same as what the Win32 API gets me?

Thanks!

PS: the reason I'm asking is because I need to use the string a lot to compare to other read values (comparing and concatenating...).

+1  A: 

When you convert the string to a UTF 16 it is a 16 byte wide character, you can't compare it to the ASCII values because they aren't 16 byte values. You have to convert them to compare, or write a specialized comparision to ASCII function.

I doubt the UTF8 cout in linux would produce the same correct output unless it were regular ASCII values, as UTF8 UTF-8 encoding forms are binary-compatible with ASCII for code points below 128, and I assume UTF16 comes after UTF8 in a simliar fashion.

The good news is there are many converters out there written to convert these strings to different character sets.

James
I know about conversions (I use them in the previous question I linked to, and I'm converting exactly because I need to perform comparisons), and I'm trying to establish if a win23 API converted string is the same as a raw UTF8 (linux) string. `cout` on linux outputs the characters nicely, that's why it uses UTF8 in the first place (well, probably lots of other reasons too). The thing is, I don't know if the `…‡‚` string is also present in a raw UTF8 string.
rubenvb
+2  A: 

Let's start by me saying that it appears that there is simply no way to output UTF-8 text to the console in Windows via cout (assuming you compile with Visual Studio). What you can do however for your tests is to output your UTF-8 text via the Win32 API fn WriteConsoleA:

if(!SetConsoleOutputCP(CP_UTF8)) { // 65001
    cerr << "Failed to set console output mode!\n";
    return 1;
}
HANDLE const consout = GetStdHandle(STD_OUTPUT_HANDLE);
DWORD nNumberOfCharsWritten;
const char* utf8 = "Umlaut AE = \xC3\x84 / ue = \xC3\xBC \n";
if(!WriteConsoleA(consout, utf8, strlen(utf8), &nNumberOfCharsWritten, NULL)) {
    DWORD const err = GetLastError();
    cerr << "WriteConsole failed with << " << err << "!\n";
    return 1;
}

This should output: Umlaut AE = Ä / ue = ü if you set your console (cmd.exe) to use the Lucida Console font.

As for your question (taken from your comment) if

a win23 API converted string is the same as a raw UTF8 (linux) string

I will say yes: Given a Unicode character sequence, it's UTF-16 (Windows wchar_t) representation converted to a UTF-8 (char) representation via the WideCharToMultiByte function will always yield the same byte sequence.

Martin
If you are using `WriteConsole` anyway, you can also use `WriteConsoleW` to write the UTF-16 string directly, eliminating the necessity for `SetConsoleOutputCP`.
Philipp
@Philipp - Yes, first converting from utf16 to utf8 and then using WriteConsoleA makes little sense. If the strings in the (test-)app are already utf8 though, it might still make sense.
Martin