views:

452

answers:

3

UPDATE: Thank you to @Potatoswatter and @Jonathan Leffler for comments - rather embarrassingly I was caught out by the debugger tool tip not showing the value of a wstring correctly - however it still isn't quite working for me and I have updated the question below:

If I have a small multibyte file I want to read into a string I use the following trick - I use getline with a delimeter of '\0' e.g.

std::string contents_utf8;
std::ifstream inf1("utf8.txt");
getline(inf1, contents_utf8, '\0');

This reads in the entire file including newlines.
However if I try to do the same thing with a wide character file it doesn't work - my wstring only reads to the the first line.

std::wstring contents_wide;
std::wifstream inf2(L"ucs2-be.txt");
getline( inf2, contents_wide, wchar_t(0) ); //doesn't work

For example my if unicode file contains the chars A and B seperated by CRLF, the hex looks like this:

FE FF 00 41 00 0D 00 0A 00 42

Based on the fact that with a multibyte file getline with '\0' reads the entire file I believed that getline( inf2, contents_wide, wchar_t(0) ) should read in the entire unicode file. However it doesn't - with the example above my wide string would contain the following two wchar_ts: FF FF

(If I remove the wchar_t(0) it reads in the first line as expected (ie FE FF 00 41 00 0D 00)

Why doesn't wchar_t(0) work as a delimiting wchar_t so that getline stops on 00 00 (or reads to the end of the file which is what I want)?
Thank you

+1  A: 

Your UCS-2 decoder is misbehaving. The result of getline( inf2, contents_wide ) on FE FF 00 41 00 0D 00 0A 00 42 should be 0041 0000 = L"A". Assuming you're on Windows, the line ending should be properly converted, and the byte-order mark shouldn't appear in the output.

Suggest double-checking your OS documentation with respect to how you set the locale.

EDIT: Did you set the locale?

locale::global( locale( "something if your system supports UCS-2" ) );

or

locale::global( encoding_support::ucs2_bigendian_encoding );

where encoding_support is some library.

Potatoswatter
No I haven't set the locale - I will experiment with it (but I don't understand why that should change whether getline reads in 1 byte or 2 bytes when the parameters are wchar_ts). I don't understand what you mean when you say "encoding_support is some library" - all I want to do is read a file into wchar_ts?
hamishmcn
Potatoswatter I am stuck, I don't know what to put as a parameter of the locale. I assume you want me to replace "something if your system supports UCS-2" with another parameter, but I don't know what. Should it be a language name? Surely the point of using wide chars is to avoid having to set code pages and the like? I am running WinXP SP3. Surely C++ can read wchar_ts from a file with out having to specify a language?
hamishmcn
@hamish: I wish I knew what to tell you. I briefly looked around MSDN's documentation, but they are focused entirely on language internationalization and not on data encoding. Try broadening your search to include UTF-16 and third-party libraries… or consider ditching the standard library and reading the file yourself. Perhaps you can contact the developer of the software that produced the file you're trying to read and ask how they did it.
Potatoswatter
I appreciate all your comments and the time you put into this. For the record - this isn't a big problem for me (I am the creator of the file I am trying to read and I _can_ read it by reading into an array of char) it just bugs me that it doesn't work the way I expected it to - I wanted a quick and easy C++ way (with std::strings and ifstreams) to do it :-)
hamishmcn
A: 

L"ucs2-be.txt" looks to me like a flag for big endian, but the array FE FF 00 41 00 0D 00 0A 00 42 looks like little endian. I guess this is why the FE FF character was read into your array instead of being skipped over. I can't figure out why the presence or absence of wchar(0) affects the results though.

Windows programmer
L"ucs2-be.txt" is just the name of the file. FE FF is big-endian.
Potatoswatter
You're right, the filename is there to mislead human readers while having no effect on machines. But something's still wrong. FE FF is big endian but everything after it is little endian.
Windows programmer
Everything else is big-endian too. The zeroes come before, the significant part comes after.
Potatoswatter
You know what people do when they mess up as badly as I did here. They delete their answers :-)
Windows programmer
@Windows programmer - Thanks - the last comment made me laugh :-)
hamishmcn
+1  A: 

See this question: http://stackoverflow.com/questions/1509277/why-does-wide-file-stream-in-c-narrow-written-data-by-default/, where the poster is surprised by the wchar_t -> char conversion when writing.

The answers given to that question apply to the reading case also. In a nutshell: at the lowest level, file I/O is always done in terms of bytes. A basic_filebuf (what the fstream uses to actually perform the I/O) uses a codecvt facet to translate between the "internal" encoding (the char type seen by the program, and used to instantiate the stream, wchar_t in your case) and the "external" encoding of the file (which is always char).

The codecvt is obtained from the stream's locale. If no locale is imbue()-d on the stream, the global locale is used. By default, the global locale is the "classic" (or "C") locale. That locale's codecvt facet is pretty basic. I don't know what the standard says about it but, in my experience on Windows, it simply "casts" between char and wchar_t, one by one. On Linux, it does this too but fails if the character's value is outside the ASCII range.

So, if you don't touch the locale (either by imbue()-ing one on the stream or changing the global one), what probably happens in your case is that chars are read from the file and cast to wchar_t one by one. It thus first reads FF, then FE, then 00, and getline(..., 0) stops right there.

Éric Malenfant
This agrees with what I saw when I copied the code for getline and replaced the templated types with the types I am using so I could step through to try and figure out what was going on - the wifstream was reading one byte at a time into my wchar_t
hamishmcn