views:

130

answers:

5

I ask a code snippet which cin a unicode text, concatenates another unicode one to the first unicode text and the cout the result.

P.S. This code will help me to solve another bigger problem with unicode. But before the key thing is to accomplish what I ask.

ADDED: BTW I can't write in the comman line any unicode symbol when I run the executable file. How I should do that?

A: 

It depends on the OS. If your OS understands you can simply send it UTF-8 sequences.

Noah Roberts
He's on Windows, which uses UTF-16, but requires special API functions (`ReadConsole`/`WriteConsole`) to work with Unicode text.
Philipp
+3  A: 

Depending on what type unicode you mean. I assume you mean you are just working with std::wstring though. In that case use std::wcin and std::wcout.

For conversion between encodings you can use your OS functions like for Win32: WideCharToMultiByte, MultiByteToWideChar or you can use a library like libiconv

Brian R. Bondy
At which point you can use UTF-16 instead of UTF-8 iff your OS understands it.
Noah Roberts
+1: wcout for wstring for wchar_t (primarily window's UTF-16), cout for string for char (Linux, UTF-8 by default)
rubenvb
`wcin` and `wcout` don't work on Windows.
Philipp
@Philipp: In what way do `wcin` and `wcout` not work for you? They won't display Unicode characters not supported by your console font, but that's a fault of the console and not iostreams.
Ben Voigt
@Ben Voight: They don't display Unicode characters at all, even if the font supports it. See my answer for an example. The reason is that they don't wrap `ReadConsoleW`/`WriteConsoleW`.
Philipp
A: 

If you have actual text (i.e., a string of logical characters), then insert to the wide streams instead. The wide streams will automatically encode your characters to match the bits expected by the locale encoding. (And if you have encoded bits instead, the streams will decode the bits, then re-encode them to match the locale.)

There is a lesser solution if you KNOW you have UTF-encoded bits (i.e., an array of bits intended to be decoded into a string of logical characters) AND you KNOW the target of the output stream is expecting that very same bit-format, then you can skip the decoding and re-encoding steps and write() the bits as-is. This only works when you know both sides use the same encoding format, which may be the case for small utilities not intended to communicate with processes in other locales.

John
There is no local encoding on Windows and thus the wide streams don't work.
Philipp
+1  A: 

I had a similar problem in the past, in my case imbue and sync_with_stdio did the trick. Try this:

#include <iostream>
#include <locale>
#include <string>

using namespace std;

int main() {
        ios_base::sync_with_stdio(false);
        wcin.imbue(locale("en_US.UTF-8"));
        wcout.imbue(locale("en_US.UTF-8"));

        wstring s;
        wstring t(L" la Polynésie française");

        wcin >> s;
        wcout << s << t << endl;
        return 0;
}
Bolo
Did to test this code?I get runtime error!
Narek
I have debugged, seams this line is the problem:wcin.imbue(locale("en_US.UTF-8"));
Narek
@Narek Yes, I did test the code. It runs without problems on my Ubuntu. What system do you have?
Bolo
Windows Vista :(
Narek
`wcin` and `wcout` don't work on Windows, just like the equivalent C functions. Only the native API works.
Philipp
A: 

Here is an example that shows four different methods, of which only the third (C conio) and the fourth (native Windows API) work (but only if stdin/stdout aren't redirected). Note that you still need a font that contains the character you want to show (Lucida Console supports at least Greek and Cyrillic). Note that everything here is completely non-portable, there is just no portable way to input/output Unicode strings on the terminal.

#ifndef UNICODE
#define UNICODE
#endif

#ifndef _UNICODE
#define _UNICODE
#endif

#define STRICT
#define NOMINMAX
#define WIN32_LEAN_AND_MEAN

#include <iostream>
#include <string>
#include <cstdlib>
#include <cstdio>

#include <conio.h>
#include <windows.h>

void testIostream();
void testStdio();
void testConio();
void testWindows();

int wmain() {
    testIostream();
    testStdio();
    testConio();
    testWindows();
    std::system("pause");
}

void testIostream() {
    std::wstring first, second;
    std::getline(std::wcin, first);
    if (!std::wcin.good()) return;
    std::getline(std::wcin, second);
    if (!std::wcin.good()) return;
    std::wcout << first << second << std::endl;
}

void testStdio() {
    wchar_t buffer[0x1000];
    if (!_getws_s(buffer)) return;
    const std::wstring first = buffer;
    if (!_getws_s(buffer)) return;
    const std::wstring second = buffer;
    const std::wstring result = first + second;
    _putws(result.c_str());
}

void testConio() {
    wchar_t buffer[0x1000];
    std::size_t numRead = 0;
    if (_cgetws_s(buffer, &numRead)) return;
    const std::wstring first(buffer, numRead);
    if (_cgetws_s(buffer, &numRead)) return;
    const std::wstring second(buffer, numRead);
    const std::wstring result = first + second + L'\n';
    _cputws(result.c_str());
}

void testWindows() {
    const HANDLE stdIn = GetStdHandle(STD_INPUT_HANDLE);
    WCHAR buffer[0x1000];
    DWORD numRead = 0;
    if (!ReadConsoleW(stdIn, buffer, sizeof buffer, &numRead, NULL)) return;
    const std::wstring first(buffer, numRead - 2);
    if (!ReadConsoleW(stdIn, buffer, sizeof buffer, &numRead, NULL)) return;
    const std::wstring second(buffer, numRead);
    const std::wstring result = first + second;
    const HANDLE stdOut = GetStdHandle(STD_OUTPUT_HANDLE);
    DWORD numWritten = 0;
    WriteConsoleW(stdOut, result.c_str(), result.size(), &numWritten, NULL);
}
  • Edit 1: I've added a method based on conio.
  • Edit 2: I've messed around with _O_U16TEXT a bit as described in Michael Kaplan's blog, but that seemingly only had wgets interpret the (8-bit) data from ReadFile as UTF-16. I'll investigate this a bit further during the weekend.
Philipp
Thanks. Please also tell me how to write in command line in unicode? I can't! It ignores and writes in latin.
Narek
Also you might want to write "main" instead of "wmain", no?
Narek
If you want to read command line arguments, declare `wmain` as `int wmain(int argc, wchar_t** argv)` (the `w` is not a typo!).
Philipp
No, anyway, I can't wtire in command line any damn letter from Armenian or Russian alphabet!
Narek
What did you try? BTW, I think you should better ask a new question, the comments aren't a good substiture for a discussion forum.
Philipp