tags:

views:

97

answers:

1

What is the correct way to read Unicode files line by line in C++?

I am trying to read a file saved as Unicode (LE) by Windows Notepad.

Suppose the file contains simply the characters A and B on separate lines.

In reading the file byte by byte, I see the following byte sequence (hex) :

FE FF 41 00 0D 00 0A 00 42 00 0D 00 0A 00

So 2 byte BOM, 2 byte 'A', 2byte CR , 2byte LF, 2 byte 'B', 2 byte CR, 2 byte LF .

I tried reading the text file using the following code:

   std::wifstream file("test.txt");
   file.seekg(2); // skip BOM
   std::wstring A_line;
   std::wstring B_line;
   getline(file,A_line);  // I get "A"
   getline(file,B_line);  // I get "\0B"

I get the same results using >> operator instead of getline

   file >> A_line;
   file >> B_line;

It appears that the single byte CR character is is being consumed only as the single byte. or CR NULL LF is being consumed but not the high byte NULL. I would expect wifstream in text mode would read the 2byte CR and 2byte LF.

What am I doing wrong? It does not seem right that one should have to read a text file byte by byte in binary mode just to parse the new lines.

+3  A: 

std::wifstream exposes Unicode UCS-2 to your program, but assumes that the input file is still using narrow characters. If you want it to behave using UCS-2, you need to use a std::codecvt<wchar_t, wchar_t> facet.

You should just be able to find your compiler's implementation of std::codecvt<char, char> which is also a non-converting code conversion facet, and change the chars to wchar_ts.

Billy ONeal