How to read/store unicode with STL strings and streams

views:

1160

answers:

+4 Q:

How to read/store unicode with STL strings and streams

I need to modify my program to accept Unicode, which may come from any of UTF-8 and the various UTF-16 and UTF-32 encodings. I don't really know much about Unicode (though I've read Joel Spolsky's article and the Wikipedia page).

Right now I'm using an std::istream and reading my input char by char, and then storing (when necessary) in an std::string. I'd like to

modify this (with as little effort) to support the above encodings, and
figure out how to test the above encodings (I'm kinda white-bread American, and don't really know how to even make a sample text file in another encoding), and ideally
do this in a cross-platform way.

Also, if possible, I'd like to conserve space as much as possible (so if we don't need more than a byte/character, we don't use it). From what I understand, this means storing in UTF-8, which is fine, but I don't know of a standard string that does this (from what I understand, wchar_t has implementation-defined size and encoding).

+1 A:

UTF-8 conserves space, as long as you are primarily using the standard ASCII characters.

std::string has no problem with UTF-8, as there is no 0 bytes in it. You can tell std::string how long the inputs chars are, if they have NULL bytes, as in UTF-32. std::string wouldn't be able to tell you how many characters your UTF-8 string is, you would have to use an external function.

Also, there is a wide version of the std::string using wchar_t, as opposed to char, I just forget the name.

Also there are facets in boost for transforming between encodings.

You can either use the standard library with boost. Or you can use the string handling functions from the C library. There are also functions provided by programming frameworks such as Qt and Tcl.

See for example:

utf8 codecvt facet

2008-12-24 07:51:17

The wide version of std::string is std::wstring

stukelly 2008-12-24 08:59:27

Thanks! It didn't turn up after a quick web search and I didn't have access to my standard library reference.

2008-12-24 09:14:01

+2 A:

Have a look at the Switching from std::string to std::wstring for embedded applications? question

As Pukku said: You might get some headache because of the fact that the C++ standard dictates that wide-streams are required to convert double-byte characters to single-byte when writing to a file, and how this conversion is done is implementation-dependent.

stukelly 2008-12-24 09:03:07

ansaurus

tags:

views:

answers:

How to read/store unicode with STL strings and streams

related questions