views:

300

answers:

3

I'm struggling to get started with the C++ ICU library. I have tried to get the simplest example to work, but even that has failed. I would just like to output a UTF-8 string and then go from there.

Here is what I have:

#include <unicode/unistr.h>
#include <unicode/ustream.h>

#include <iostream>

int main()
{
    UnicodeString s = UNICODE_STRING_SIMPLE("привет");

    std::cout << s << std::endl;

    return 0;
}

Here is the output:

$ g++ -I/sw/include -licucore -Wall -Werror -o icu_test main.cpp 
$ ./icu_test 
пÑивеÑ

My terminal and font support UTF-8 and I regularly use the terminal with UTF-8. My source code is in UTF-8.

I think that perhaps I somehow need to set the output stream to UTF-8 because ICU stores strings as UTF-16, but I'm really not sure and I would have thought that the operators provided by ustream.h would do that anyway.

Any help would be appreciated, thank you.

+1  A: 

What happens if you write the output to a file (either redirecting using pipes from the terminal, or by opening a file stream in the program itself)

That would determine whether or not it is the terminal that fails to handle the output correctly.

What happens if you inspect the output string in the debugger? Does it contain the correct values? Find out what the UTF-8 encoding of your string should look like, and compare it against what you get in the debugger. Or print out the integral value of each byte, and verify that those are correct.

When working with encoding it is always tricky (but essential) to determine whether the problem lies in your program itself or in the conversion that happens when the text is output to the system. Take the terminal out of the equation and verify that your program generates the correct output.

jalf
Writing to a file is a very good step in debugging encodings.
Dr. Watson
I have just written to the file and I get the same output. I will have a look in the debugger just now.
Isaac
+3  A: 

Your program will work if you just change the initializer to:

UnicodeString s("привет");

The macro you were using is only for strings that contain "invariant characters", i.e., only latin letters, digits, and some punctuation.

As was said before, input/output codepages are tricky. You said:

My terminal and font support UTF-8 and I regularly use the terminal with UTF-8. My source code is in UTF-8.

That may be true, but ICU doesn't know that's true. The process codepage might be different (let's say iso-8859-1), and the output codepage may be different (let's say shift-jis). Then, the program wouldn't work. But, the invariant characters using the API UNICODE_STRING_SIMPLE would still work.

Hope this helps.

srl, icu dev

Steven R. Loomis
Thank you! That does indeed work. Since you signed off with 'icu dev', maybe you will know: do you know about any IRC channels for ICU help? I searched, but I couldn't find any.
Isaac
I don't know of any IRC channels - are we that popular? I sometimes watch here (and occasionally do other web searches ) but our icu-support mailing list and bug database on http://icu-project.org are the main channels. That's an interesting idea. You could propose it there. I'm the technical lead for ICU for C/C++.
Steven R. Loomis
Well, I've been doing quite a lot of searching over the past few days, looking for a Unicode solution and ICU is considered to be the 'best' for C++ from all of the sources I've read. All of the same sources also complain that the documentation is severely lacking and there are plenty of other forum posts saying the same thing. Given that I couldn't even get a 'hello world' style program to work, I would agree with this, sorry. I know it's not your fault, but if you have any influence, please make some suggestions about improving the docs.
Isaac
Maybe it would be helpful to know what documentation you tried and which were helpful/unhelpful. Had you seen the documentation for UNICODE_STRING_SIMPLE or found it somewhere? It would work for UNICODE_STRING_SIMPLE("privet") but not for the string you tried. That's in the API docs- but that's not your fault either, if they are hard to find or weren't helpful. "Improve the docs" is a bit broad of a task, filing a bug with specifics would help us. I'll try sitting on irc://irc.freenode.net/icu
Steven R. Loomis
A: 

operator<<(ostream, UnicodeString) converts between UTF16 and chars by using ICU's "default converter". AFAIU, the "default converter" (if you don't set it explicitly with ucnv_setDefaultName()) depends on the platform and the way ICU was compiled. What do you get from ucnv_getDefaultName()?

Éric Malenfant
FWIW the standalone tool 'icuinfo' reports the default codepage as of 4.4. The default converter can come from many wild and wonderful places.
Steven R. Loomis
My problem has now been solved, but to answer your question, I get 'en_GB'.
Isaac
icuinfo should return something like:Default locale: en_US…Default converter: UTF-8
Steven R. Loomis