tags:

views:

903

answers:

7

Looking at the unicode standard, they recommend to use plain chars for storing UTF-8 encoded strings. Does this work as expected with C++ and the basic std::string, or do cases exist in which the UTF-8 encoding can create problems?

For example, when computing the length, it may not be identical to the number of bytes - how is this supposed to be handled? Reading the standard, I'm probably fine using a char array for storage, but I'll still need to write functions like strlen etc. on my own, which work on encoded text, cause as far as I understand the problem, the standard routines are either ASCII only, or expect wide literals (16bit or more), which are not recommended by the unicode standard. So far, the best source I found about the encoding stuff is a post on Joel's on Software, but it does not explain what we poor C++ developer should use :)

A: 

From UTF-8 and Unicode FAQ: C support for Unicode:

#include <stdio.h>
#include <locale.h>

int main()
{
  if (!setlocale(LC_CTYPE, "")) {
    fprintf(stderr, "Can't set the specified locale! "
            "Check LANG, LC_CTYPE, LC_ALL.\n");
    return 1;
  }
  printf("%ls\n", L"Schöne Grüße");
  return 0;
}

Also from here:

The good news is that if you use wchar_t* strings and the family of functions related to them such as wprintf, wcslen, and wcslcat, you are dealing with Unicode values. In the C++ world, you can use std::wstring to provide a friendly interface. My only complaint is that these are 32-bit (4 byte) characters, so they are memory hogs for all languages. The reason for this choice is that it guarantees each possible character can be represented by one value.

PS. This is probably Linux-specific. There is a ICU library to handle complicated things.

jetxee
This does not work right when I try it on OS X with GCC 4.01: It prints the non-ASCII chars as escaped chars in octal code. When I write printf("%s\n", "Schöne Grüße");instead, it prints correcly.Hence, this is not solution to getting the number of utf-8 characters in a string.
Thomas Tempelmann
I cannot tell for OS X, but this example definitely works with GCC 4.3.2 on GNU/Linux, *in a UTF-8* locale. What is your locale in OS X? I suspect it is not a Unicode locale. Also, probably, locales are handled differently in OS X, I don't know.
jetxee
Wrong on so many levels, I'm afraid. Chars outside the guaranteed charset; assuming the console can print wchar_t's. wchar_t is 2 bytes on most PCs,
MSalters
1) L"str" has a type of an array of wchar_t and is an example. 2) My console definitely can print wchar_t's and most of the Unicode; locale takes care of conversions. 3) For GNU systems wchar_t is always 32 bits wide unless special compiler flags are used. 4) I didn't say it is cross-platform.
jetxee
+3  A: 

There's a library called "UTF8-CPP", which lets you store your UTF-8 strings in standard std::string objects, and provides additional functions to enumerate and manipulate utf-8 characters.

I haven't tested it yet, so I don't know what it's worth, but I am considering using it myself.

Carl Seleborg
This is probably the way to go. There is also ICU library, which does more or less the same thing.
jetxee
A: 

What we settled with: store UTF8 in a std::string. You can do most operations now, except for things like computing the length. Use a UTF8->std::wstring conversion function (boost::from_utf8, for example) to convert to a std::wstring when you need such operations.

+1  A: 

It depends what you want to do with the UTF8 String. If all you are interested in is reading in and out UTF8 strings then it all works as long as you have set the correct locale. We have done this for some time. We have serveal server process that do nothing with strings as such. There strings are set by the user in Java and arrive as UTF8 and we handle them in standard c str buffers. We then send the data back to Java that converts it back.

If you want the length in UTF8 characters then you want functions that can handle the translation for you.

But you can roll your own for example utf8-strlen

David Allan Finch
+1  A: 

strlen counts the number of non-null chars before the first \0. In UTF-8, that count is a sane number (number of bytes used), but the count is not the number of characters (one UTF-8 character is typically 1-4 chars). basic_string doesn't store a \0, but it too keeps a byte count.

strcpy or the basic_string copy ctor copy all bytes without looking too closely.

Finding a substring works OK, because of the way UTF_8 is encoded. The allowed values for the first byte of a character is distinct from the second to 4th byte (the former never start with 10xxxxxx, the latter always)

Taking a substring is tricky - how do you specify the position? If the begin and end were found by searching for ASCII text markers (e.g. [ and ]) then there's no problem. You'd just get the bytes in the middle, which are a valid UTF8 string too. You can't harcode positions, or even relative offsets though. Even a relative offset of +1 character can be hard; how many bytes is that? You will end up writing a function like SkipOneChar.

MSalters
A: 

You should use std::wstring for all your internal manipulation of strings. This prevents the problems of having oddly-sized characters inside of your string. You then convert these characters to UTF-8 when communicating with the external world.

The document refers to storing strings as Unicode itself, which you generally don't want to do inside your program in wchar_t, but as char. The key point here is: Don't use Unicode itself in the internals of your program, rather, use wide characters.

yuriks
I would rather say: if you need to manipulate strings, don't store them in variable length encodings (UTF-8, UTF-16), but rather in UTF-32 (in wchar_t or other 32 bit type). Otherwise reading and writing UTF-8 as byte streams is enough.
jetxee
Storing UTF-32 encoded strings rarely makes sense. It takes too much memory and does not solve many of the problems people usually think it solves. For instance - even with UTF-32 encoding you cannot assume that each 32-bit unit corresponds to a single character as perceived by users.
Nemanja Trifunovic
@Nemanja Trifunovic: Why is that so? I was always assuming that UTF-16 was not able to store all single characters, but UTF-32 was.
mghie
@jetxee: Is "in wchar_t or other 32 bit type" really correct? I was always assuming that wchar_t could be either 16 bit or 32 bit, depending on the implementation.
mghie
+2  A: 

An example with ICU library (C, C++, Java):

#include <iostream>
#include <unicode/unistr.h> // using ICU library

int main(int argc, char *argv[]) {
    // constructing a Unicode string
    UnicodeString ustr1("Привет"); // using platform's default codepage
    // calculating the length in characters, should be 6
    int ulen1=ustr1.length();
    // extracting encoded characters from a string
    int const bufsize=25;
    char encoded[bufsize];
    ustr1.extract(0,ulen1,encoded,bufsize,"UTF-8"); // forced UTF-8 encoding
    // printing the result
    std::cout << "Length of " << encoded << " is " << ulen1 << "\n";
    return 0;
}

building like

$ g++ -licuuc -o icu-example{,.cc}

running

$ ./icu-example
Length of Привет is 6

Works for me on Linux with GCC 4.3.2 and libicu 3.8.1. Please note that it prints in UTF-8 no matter what the system locale is. You won't see it correctly if yours is not UTF-8.

jetxee