views:

572

answers:

3

When I read a text file to a wide character string (std::wstring) using an wifstream, does the stream implementation support different encodings - i.e. can it be used to read e.g. ASCII, UTF-8, and UTF-16 files?

If not, what would I have to do?

(I need to read the entire file, if that makes a difference)

A: 

The design of wide character string and wide character stream pre-dates UTF-8, UTF-16 and Unicode. If you want to get technical, the standard string and the standard stream don't necessarily operate on ASCII (it's just that basically all computers out there use ASCII; you could potentially have an EBCDIC machine).

Raymond Chen once wrote a series illustrating how to work with different wide character stream/string types.

Max Lybbert
+2  A: 

ifstream does not care about encoding of file. It just reads chars(bytes) from file. wifstream reads wide bytes(wchar_t), but it still doesn't know anything about file encoding. wifstream is good enough for UCS-2 — fixed-length character encoding for Unicode (each character represented with two bytes).

You could use IBM ICU library to deal with Unicode files.

The International Component for Unicode (ICU) is a mature, portable set of C/C++ and Java libraries for Unicode support, software internationalization (I18N) and globalization (G11N), giving applications the same results on all platforms.

ICU is released under a nonrestrictive open source license that is suitable for use with both commercial software and with other open source or free software.

Kirill V. Lyadvinsky
I think it's slightly more correct to say that `ifstream` abstracts over the encoding. It makes use of it through lower-level facilities: locales (for standard C++), and OS or library specific i18n functions. i.e. `ifstream` may not care, but you do care what it calls in this case.
quark
`locales` has nothing to do with the encodings of Unicode. When you are setting the locale, you just give a hint to `iostream` how it should represent symbols on console. But you cannot detect encoding of the *file*. And it is impossible to distinguish ANSI from UTF-8 by using `ifstream`.
Kirill V. Lyadvinsky
+7  A: 

C++ supports character encodings by means of std::locale and the facet std::codecvt. The general idea is that a locale object describes the aspects of the system that might vary from culture to culture, (human) language to language. These aspects are broken down into facets, which are template arguments that define how localization-dependent objects (include I/O streams) are constructed. When you read from an istream or write to a ostream, the actual writing of each character is filtered through the locale's facets. The facets cover not only encoding of Unicode types but such varied features as how large numbers are written (e.g. with commas or periods), currency, time, capitalization, and a slew of other details.

However just because the facilities exist to do encodings doesn't mean the standard library actually handles all encodings, nor does it make such code simple to do right. Even such basic things as the size of character you should be reading into (let alone the encoding part) is difficult, as wchar_t can be too small (mangling your data), or too large (wasting space), and the most common compilers (e.g. Visual C++ and Gnu C++) do differ on how big their implementation is. So you generally need to find external libraries to do the actual encoding.

  • iconv is generally acknowledge to be correct, but examples of how to bind it to the C++ mechanism are hard to find.
  • jla3ep mentions libICU, which is very thorough but the C++ API does not try to play nicely with the standard (As far as I can tell: you can scan the examples to see if you can do better.)

The most straightforward example I can find that covers all the bases, is from Boost's UTF-8 codecvt facet, with an example that specifically tries to encode UTF-8 (UCS4) for use by IO streams. It looks like this, though I don't suggest just copying it verbatim. It takes a little more digging in the source to understand it (and I don't claim to):

typedef wchar_t ucs4_t;

std::locale old_locale;
std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);

...

std::wifstream input_file("data.utf8");
input_file.imbue(utf8_locale);
ucs4_t item = 0;
while (ifs >> item) { ... }

To understand more about locales, and how they use facets (including codecvt), take a look at the following:

quark
Nice summary. You might want to add http://www.amazon.com/dp/0201183951 to your book list. It's the most thorough treatment of the issue I know.
sbi
sbi: Added the book to the list. Thanks for the nice link.
quark