views:

243

answers:

4

Assuming a very simple program that:

  • ask a name.
  • store the name in a variable.
  • display the variable content on the screen.

It's so simple that is the first thing that one learns.

But my problem is that I don't know how to do the same thing if I enter the name using japanese characters.

So, if you know how to do this in C++, please show me an example (that I can compile and test)

Thanks.


user362981 : Thanks for your help. I compiled the code that you wrote without problem, them the console window appears and I cannot enter any Japanese characters on it (using IME). Also if I change a word in your code ("hello") to one that contains Japanese characters, it also will not display these.

Svisstack : Also thanks for your help. But when I compile your code I get the following error:

warning: deprecated conversion from string constant to 'wchar_t*'
error: too few arguments to function 'int swprintf(wchar_t*, const wchar_t*, ...)'
error: at this point in file
warning: deprecated conversion from string constant to 'wchar_t*'
A: 

Try replacing cout with wcout, cin with wcin, and string with wstring. Depending on your platform, this may work:

#include <iostream>
#include <string>

int main() {
  std::wstring name;
  std::wcout << L"Enter your name: "; 
  std::wcin >> name;
  std::wcout << L"Hello, " << name << std::endl;
}

There are other ways, but this is sort of the "minimal change" answer.

EvanED
actually i think you still have to create a locale with a ctype facet which matches the encoding the console uses, and then do a `std::wcout.imbue` and a `std::wcin.imbue` (and afaik with microsofts buggy stl implementation a `std::locale::global` aswell) before using the wstreams.
smerlin
A: 
#include <stdio.h>
#include <wchar.h>

int main()
{
    wchar_t name[256];

    wprintf(L"Type a name: ");
    wscanf(L"%s", name);

    wprintf(L"Typed name is: %s\n", name);

    return 0;
}
Svisstack
You want wscanf and wprintf, not the string-reading and string-writing equivalents.
Owen S.
@Owen: Yes i missed it, thanks
Svisstack
+1  A: 

You can do simple things with the generic wide character support in your OS of choice, but generally C++ doesn't have good built-in support for unicode, so you'll be better off in the long run looking into something like ICU.

Nick Bastin
+5  A: 

You're going to get a lot of answers about wide characters. Wide characters, specifically wchar_t do not equal Unicode. You can use them (with some pitfalls) to store Unicode, just as you can an unsigned char. wchar_t is extremely system-dependent. To quote the Unicode Standard, version 5.2, chapter 5:

With the wchar_t wide character type, ANSI/ISO C provides for inclusion of fixed-width, wide characters. ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but requires that the characters from the portable C execution set correspond to their wide character equivalents by zero extension.

and that

The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers.

So, it's implementation defined. Here's two implementations: On Linux, wchar_t is 4 bytes wide, and represents text in the UTF-32 encoding (regardless of the current locale). (Either BE or LE depending on your system, whichever is native.) Windows, however, has a 2 byte wide wchar_t, and represents UTF-16 code units with them. Completely different.

A better path: Learn about locales, as you'll need to know that. For example, because I have my environment setup to use UTF-8 (Unicode), the following program will use Unicode:

#include <iostream>

int main()
{
    setlocale(LC_ALL, "");
    std::cout << "What's your name? ";
    std::string name;
    std::getline(std::cin, name);
    std::cout << "Hello there, " << name << "." << std::endl;
    return 0;
}

...

$ ./uni_test
What's your name? 佐藤 幹夫
Hello there, 佐藤 幹夫.
$ echo $LANG
en_US.UTF-8

But there's nothing Unicode about it. It merely reads in characters, which come in as UTF-8 because I have my environment set that way. I could just as easily say "heck, I'm part Czech, let's use ISO-8859-2": Suddenly, the program is getting input in ISO-8859-2, but since it's just regurgitating it, it doesn't matter, the program will still perform correctly.

Now, if that example had read in my name, and then tried to write it out into an XML file, and stupidly wrote <?xml version="1.0" encoding="UTF-8" ?> at the top, it would be right when my terminal was in UTF-8, but wrong when my terminal was in ISO-8859-2. In the latter case, it would need to convert it before serializing it to the XML file. (Or, just write ISO-8859-2 as the encoding for the XML file.)

On many POSIX systems, the current locale is typically UTF-8, because it provides several advantages to the user, but this isn't guaranteed. Just outputting UTF-8 to stdout will usually be correct, but not always. Say I am using ISO-8859-2: if you mindlessly output an ISO-8859-1 "è" (0xE8) to my terminal, I'll see a "č" (0xE8). Likewise, if you output a UTF-8 "è" (0xC3 0xA8), I'll see (ISO-8859-2) "è" (0xC3 0xA8). This barfing of incorrect characters has been called Mojibake.

Often, you're just shuffling data around, and it doesn't matter much. This typically comes into play when you need to serialize data. (Many internet protocols use UTF-8 or UTF-16, for example: if you got data from an ISO-8859-2 terminal, or a text file encoded in Windows-1252, then you have to convert it, or you'll be sending Mojibake.)

Sadly, this is about the state of Unicode support, in both C and C++. You have to remember: these languages are really system-agnostic, and don't bind to any particular way of doing it. That includes character-sets. There are tons of libraries out there, however, for dealing with Unicode and other character sets.

In the end, it's not all that complicated really: Know what encoding your data is in, and know what encoding your output should be in. If they're not the same, you need to do a conversion. This applies whether you're using std::cout or std::wcout. In my examples, stdin or std::cin and stdout/std::cout were sometimes in UTF-8, sometimes ISO-8859-2.

Thanatos
A UTF-8 "è" is `0xC3 0xA8`, not `0xE8`. You probably meant ISO-8859-1.
dan04
@dan04: Excellent catch, thank you! `0xE8` is the Unicode code point (but, like you said, not the UTF-8 encoding) for "è". I've updated my example.
Thanatos