ansaurus

Question

problem using getline with a unicode file

Answer 1

+1 A:

Your UCS-2 decoder is misbehaving. The result of getline( inf2, contents_wide ) on FE FF 00 41 00 0D 00 0A 00 42 should be 0041 0000 = L"A". Assuming you're on Windows, the line ending should be properly converted, and the byte-order mark shouldn't appear in the output.

Suggest double-checking your OS documentation with respect to how you set the locale.

EDIT: Did you set the locale?

locale::global( locale( "something if your system supports UCS-2" ) );

or

locale::global( encoding_support::ucs2_bigendian_encoding );

where encoding_support is some library.

Potatoswatter 2010-04-28 00:11:34

No I haven't set the locale - I will experiment with it (but I don't understand why that should change whether getline reads in 1 byte or 2 bytes when the parameters are wchar_ts). I don't understand what you mean when you say "encoding_support is some library" - all I want to do is read a file into wchar_ts?

hamishmcn 2010-04-28 00:47:05

Potatoswatter I am stuck, I don't know what to put as a parameter of the locale. I assume you want me to replace "something if your system supports UCS-2" with another parameter, but I don't know what. Should it be a language name? Surely the point of using wide chars is to avoid having to set code pages and the like? I am running WinXP SP3. Surely C++ can read wchar_ts from a file with out having to specify a language?

hamishmcn 2010-04-28 02:40:02

@hamish: I wish I knew what to tell you. I briefly looked around MSDN's documentation, but they are focused entirely on language internationalization and not on data encoding. Try broadening your search to include UTF-16 and third-party libraries… or consider ditching the standard library and reading the file yourself. Perhaps you can contact the developer of the software that produced the file you're trying to read and ask how they did it.

Potatoswatter 2010-04-28 07:27:50

I appreciate all your comments and the time you put into this. For the record - this isn't a big problem for me (I am the creator of the file I am trying to read and I _can_ read it by reading into an array of char) it just bugs me that it doesn't work the way I expected it to - I wanted a quick and easy C++ way (with std::strings and ifstreams) to do it :-)

hamishmcn 2010-04-28 08:11:14

Answer 2

A:

L"ucs2-be.txt" looks to me like a flag for big endian, but the array FE FF 00 41 00 0D 00 0A 00 42 looks like little endian. I guess this is why the FE FF character was read into your array instead of being skipped over. I can't figure out why the presence or absence of wchar(0) affects the results though.

Windows programmer 2010-04-28 00:12:48

L"ucs2-be.txt" is just the name of the file. FE FF is big-endian.

Potatoswatter 2010-04-28 00:14:03

You're right, the filename is there to mislead human readers while having no effect on machines. But something's still wrong. FE FF is big endian but everything after it is little endian.

Windows programmer 2010-04-28 00:22:29

Everything else is big-endian too. The zeroes come before, the significant part comes after.

Potatoswatter 2010-04-28 00:34:09

You know what people do when they mess up as badly as I did here. They delete their answers :-)

Windows programmer 2010-04-28 03:58:02

@Windows programmer - Thanks - the last comment made me laugh :-)

hamishmcn 2010-04-28 08:13:04

Answer 3

+1 A:

See this question: http://stackoverflow.com/questions/1509277/why-does-wide-file-stream-in-c-narrow-written-data-by-default/, where the poster is surprised by the wchar_t -> char conversion when writing.

The answers given to that question apply to the reading case also. In a nutshell: at the lowest level, file I/O is always done in terms of bytes. A basic_filebuf (what the fstream uses to actually perform the I/O) uses a codecvt facet to translate between the "internal" encoding (the char type seen by the program, and used to instantiate the stream, wchar_t in your case) and the "external" encoding of the file (which is always char).

The codecvt is obtained from the stream's locale. If no locale is imbue()-d on the stream, the global locale is used. By default, the global locale is the "classic" (or "C") locale. That locale's codecvt facet is pretty basic. I don't know what the standard says about it but, in my experience on Windows, it simply "casts" between char and wchar_t, one by one. On Linux, it does this too but fails if the character's value is outside the ASCII range.

So, if you don't touch the locale (either by imbue()-ing one on the stream or changing the global one), what probably happens in your case is that chars are read from the file and cast to wchar_t one by one. It thus first reads FF, then FE, then 00, and getline(..., 0) stops right there.

Éric Malenfant 2010-04-28 15:18:13

This agrees with what I saw when I copied the code for getline and replaced the templated types with the types I am using so I could step through to try and figure out what was going on - the wifstream was reading one byte at a time into my wchar_t

hamishmcn 2010-04-28 21:58:59

ansaurus

tags:

views:

answers:

problem using getline with a unicode file

related questions