views:

118

answers:

2

My code is basically this:

wstring japan = L"日本";
wstring message = L"Welcome! Japan is ";

message += japan;

wprintf(message.c_str());

I'm wishing to use wide strings but I do not know how they're outputted, so I used wprintf. When I run something such as:

./widestr | hexdump

The hexidecimal codepoints create this:

65 57 63 6c 6d 6f 21 65 4a 20 70 61 6e 61 69 20 20 73 3f 3f
e  W  c  l  m  o  !  e  J     p  a  n  a  i        s  ?  ?

Why are they all jumped in order? I mean if the wprintf is wrong I still don't get why it'd output in such a specific jumbled order!

edit: endianness or something? they seem to rotate each two characters. huh.

EDIT 2: I tried using wcout, but it outputs the exact same hexidecimal codepoints. Weird!

A: 

wstring japan = L"日本";

Is not a valid way to define a UTF-16 literal. Before C++0x, there's no way to do that. Compiler would only parse ASCII characters stored in the file, no matter what's encoding of the source file (which may or may not be UTF-16).

In C++0x, there's a whole bunch of unicode features for that.

In C++03, you will, unfortunately, need to load it from external string table, or, write binary literals: L"\x1234\x1234" etc.

Pavel Radzivilovsky
This is not correct, GCC does the right job, also MSVC does (but it needs BOM for it to know that sources are UTF-8)
Artyom
Who said anything about UTF-16? ;) There isn't even any reference to unicode in the question other that the tag.We don't actually know what character encoding the source files is in or what encoding the execution wide character set is using (no information about the platform at all). It's perfectly feasible that what's posted might work but it's totally environment dependent.
Charles Bailey
environmentally dependent == shouldn't be used. Widechar == utf-16.
Pavel Radzivilovsky
@Pavel widechar==utf-16 only on Windows... If you don't care about one... it is fine storage for code-points.
Artyom
@Pavel Radzivilovsky: "Widechar == utf-16" Nope. Widechar == utf-32 on linux. 4 bytes per character. I.e. sizoef(wchar_t) is platform/implementation -dependent.
SigTerm
+3  A: 

You need to define locale

    #include <stdio.h>
    #include <string>
    #include <locale>
    #include <iostream>

    using namespace std;

    int main()
    {

            std::locale::global(std::locale(""));
            wstring japan = L"日本";
            wstring message = L"Welcome! Japan is ";

            message += japan;

            wprintf(message.c_str());
            wcout << message << endl;
    }

Works as expected (i.e. convert wide string to narrow UTF-8 and print it).

When you define global locale to "" - you set system locale (and if it is UTF-8 it would be printed out as UTF-8 - i.e. wstring will be converted)

Edit: forget what I said about sync_with_stdio -- this is not correct, they are synchronized by default. Not needed.

Artyom
You make it sound like `sync_with_stdio` and `wcout` are alternatives; they do completely different things. `sync_with_stdio` is required if you want to interleave C stream functions (like `wprintf`) with C++ stream usage (`wcout`); `imbue` is needed if you want to change the locale used by `wcout`.
Charles Bailey
I can't test it, but `wcout` should work without codepage settings on Windows because `wchar_t` is a UTF-16 code unit on Windows and UTF-16 is Windows's only native encoding. So `std::wcout` should use `WriteConsoleW` without any locale conversion. If it doesn't, it's a library bug.
Philipp
@Philipp It is not how this is defined by standard. Standard says that wide characters should be converted to narrow encoding according to locale's codepage. And this is what is done. The issue with Windows is that it does not support UTF-8. So for Windows you probably need to use `locale::globale(locale("Japan"))` and it would use Shift-JIS encoding in output. Otherwise it would fail to convert characters.
Artyom
microsofts standard libraries `wcout` implementation uses the global `c-locale` internally, so imbueing a locale is practically useless. You have to set the desired locale as global locale...
smerlin
@Artyom: Thanks for the comment. This means that `std::wcout` is essentially useless on Windows. I'd consider this to be a mistake in the C++ standard that is unnecessarily biased towards Unix. BTW, Windows consoles do support UTF-8 (via `SetConsoleCodePage`), but all code pages are obsolete and only kept for compatibility reasons. Shift-JIS is even more obsolete than UTF-8 because it's not a Unicode encoding. So it seems that one really has to call `WriteConsoleW` directly.
Philipp
regarding my comment: this is only true for the ctype facet, imbueing a locale works for all other facets AFAIK
smerlin
@Artyom, , @others thanks, it helped me learn an annoying part of the language. Works fine now.
John D.