tags:

views:

270

answers:

3

What is the best type, in C++, for storing UTF-8 string? I'd like to avoid rolling my own class if possible.

My original thought was std::string -- however, this uses char as the underlying type. char may be unsigned or signed - it varies. On my system, it's signed. UTF-8 code units, however, are unsigned octets. This seems to indicate that it's the wrong type.

This leads us to std::basic_string - which seems to fit the bill: unsigned, 8-bit (or larger) chars.

However, most things seem to use char. glib, for example, uses char. C++'s ostream's use char.

Thoughts?

+5  A: 

I'd just use std::string, as it is consistent with the UTF-8 ideal of treating data just as you would null-terminated ASCII strings unless you actually need their unicode-ness.

I also like GTKmm's Glib::ustring, but that only works if you're writing a GTKmm (or at least Glibmm) application.

Michael E
The only real problem with using std::string is that some operators which are expected to work on characters could actually end up dealing with partial characters since UTF-8 is a multibyte encoding. For example, using operator[] would be broken for getting "characters" in addition to the string length not being directly accessible. (Size is available, but not length).
Evan Teran
+5  A: 

I've always just used std::string, myself -- somehow, the "signed" vs "unsigned" philosophical question just about never comes up as problematic in such a context (encoders and decoders to/from UTF-8 are things you only write rarely, after all; in an application context, you're just using the std::string as a "black box" of sorts!-).

Alex Martelli
+2  A: 

UTF-8 is a variable-length character encoding. std::basic_string supports only fixed-length character encodings. If you need to support variable-length encodings you could try ICU4C library.

ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.

If you need just to store UTF-8 string I'd recommend to use std::vector<char>. That will indicate that you cannot perform actual string operations (which could be incorrect) on stored data.

Kirill V. Lyadvinsky