views:

1160

answers:

2

I need to modify my program to accept Unicode, which may come from any of UTF-8 and the various UTF-16 and UTF-32 encodings. I don't really know much about Unicode (though I've read Joel Spolsky's article and the Wikipedia page).

Right now I'm using an std::istream and reading my input char by char, and then storing (when necessary) in an std::string. I'd like to

  • modify this (with as little effort) to support the above encodings, and
  • figure out how to test the above encodings (I'm kinda white-bread American, and don't really know how to even make a sample text file in another encoding), and ideally
  • do this in a cross-platform way.

Also, if possible, I'd like to conserve space as much as possible (so if we don't need more than a byte/character, we don't use it). From what I understand, this means storing in UTF-8, which is fine, but I don't know of a standard string that does this (from what I understand, wchar_t has implementation-defined size and encoding).

+1  A: 

UTF-8 conserves space, as long as you are primarily using the standard ASCII characters.

std::string has no problem with UTF-8, as there is no 0 bytes in it. You can tell std::string how long the inputs chars are, if they have NULL bytes, as in UTF-32. std::string wouldn't be able to tell you how many characters your UTF-8 string is, you would have to use an external function.

Also, there is a wide version of the std::string using wchar_t, as opposed to char, I just forget the name.

Also there are facets in boost for transforming between encodings.

You can either use the standard library with boost. Or you can use the string handling functions from the C library. There are also functions provided by programming frameworks such as Qt and Tcl.

See for example:

utf8 codecvt facet

The wide version of std::string is std::wstring
stukelly
Thanks! It didn't turn up after a quick web search and I didn't have access to my standard library reference.