What is the correct way to read Unicode files line by line in C++?
I am trying to read a file saved as Unicode (LE) by Windows Notepad.
Suppose the file contains simply the characters A and B on separate lines.
In reading the file byte by byte, I see the following byte sequence (hex) :
FE FF 41 00 0D 00 0A 00 42 00 0D 00 0A 00
So 2 byte BOM, 2 byte 'A', 2byte CR , 2byte LF, 2 byte 'B', 2 byte CR, 2 byte LF .
I tried reading the text file using the following code:
std::wifstream file("test.txt");
file.seekg(2); // skip BOM
std::wstring A_line;
std::wstring B_line;
getline(file,A_line); // I get "A"
getline(file,B_line); // I get "\0B"
I get the same results using >> operator instead of getline
file >> A_line;
file >> B_line;
It appears that the single byte CR character is is being consumed only as the single byte. or CR NULL LF is being consumed but not the high byte NULL. I would expect wifstream in text mode would read the 2byte CR and 2byte LF.
What am I doing wrong? It does not seem right that one should have to read a text file byte by byte in binary mode just to parse the new lines.