+1  A: 

Take a look at How do I print UTF-8 from c++ console application on Windows

Xinus
Good idea, but Japanese is a whole 'nother game. Besides, entering "chcp 65001" in the command window didn't affect the debug window...
MPelletier
+1  A: 

I tried this code:

#include <iostream>
#include <fstream>
#include <sstream>

int main()
{
    std::wstringstream wss;
    wss << L"àéêù";
    std::wstring s = wss.str();
    const wchar_t* p = s.c_str();
    std::wcout << ws.str() << std::endl;

    std::wofstream file("C:\\a.txt");
    file << p << endl;

    return 0;
}

The debugger showed that wss, s and p all had the expected values (i.e. "àéêù"), as did the output file. However, what appeared in the console was óúÛ¨.

The problem is therefore in the Visual Studio console, not the C++. Using Bahbar's excellent answer, I added:

    SetConsoleOutputCP(1252);

as the first line, and the console output then appeared as it should.

Charles Anderson
+5  A: 

Before I go any further, I should mention that what you are doing is not c/c++ compliant. The specification states in 2.2 what character sets are valid in source code. It ain't much in there, and all the characters used are in ascii. So... Everything below is about a specific implementation (as it happens, VC2008 on a US locale machine).

To start with, you have 4 chars on your cout line, and 4 glyphs on the output. So the issue is not one of UTF8 encoding, as it would combine multiple source chars to less glyphs.

From you source string to the display on the console, all those things play a part:

  1. What encoding your source file is in (i.e. how your C++ file will be seen by the compiler)
  2. What your compiler does with a string literal, and what source encoding it understands
  3. how your << interprets the encoded string you're passing in
  4. what encoding the console expects
  5. how the console translates that output to a font glyph.

Now...

1 and 2 are fairly easy ones. It looks like the compiler guesses what format the source file is in, and decodes it to its internal representation. It generates the string literal corresponding data chunk in the current codepage no matter what the source encoding was. I have failed to find explicit details/control on this.

3 is even easier. Except for control codes, << just passes the data down for char *.

4 is controlled by SetConsoleOutputCP. It should default to your default system codepage. You can also figure out which one you have with GetConsoleOutputCP (the input is controlled differently, through SetConsoleCP)

5 is a funny one. I banged my head to figure out why I could not get the é to show up properly, using CP1252 (western european, windows). It turns out that my system font does not have the glyph for that character, and helpfully uses the glyph of my standard codepage (capital Theta, the same I would get if I did not call SetConsoleOutputCP). To fix it, I had to change the font I use on consoles to Lucida Console (a true type font).

Some interesting things I learned looking at this:

  • the encoding of the source does not matter, as long as the compiler can figure it out (notably, changing it to UTF8 did not change the generated code. My "é" string was still encoded with CP1252 as 233 0 )
  • VC is picking a codepage for the string literals that I do not seem to control.
  • controlling what the console shows is more painful than what I was expecting

So... what does this mean to you ? Here are bits of advice:

  • don't use non-ascii in string literals. Use resources, where you control the encoding.
  • make sure you know what encoding is expected by your console, and that your font has the glyphs to represent the chars you send.
  • if you want to figure out what encoding is being used in your case, I'd advise printing the actual value of the character as an integer. char * a = "é"; std::cout << (unsigned int) (unsigned char) a[0] does show 233 for me, which happens to be the encoding in CP1252.

BTW, if what you got was "ÓÚÛ¨" rather than what you pasted, then it looks like your 4 bytes are interpreted somewhere as CP850.

Bahbar
Using resources.. Definitely gotta look into that. Here's where it gets tougher though: The console acts as a filter of sorts, because if I "cin>>" some accented letters, lo and behold, funny characters are gotten on the other side! I'm not at that machine at the moment, but I will try to reoutput what I get from cin and see if it gets garbled further or reverts back.
MPelletier
Excellent answer. I shall certainly make a note of this.
Charles Anderson
A: 

This kind of errors happen in the rich textbox1 too, I think there is a bug on visual studio 2010, nice work guy

thim
My understanding is not that the bug is with Visual Studio per se, but that non-ascii characters (even if they are just extended ascii) are just not proper to use *in code*. I have moved on from that project and never looked at using resources, but obviously, characters outside of ascii (even accented Latin letters) need more support and can't print directly.
MPelletier