views:

174

answers:

4

Hi guys, I have to open file with cyrillic symbols. I've encoded file into utf8. Here is example:

en: Couldn't your family afford a costume for you
  ru: Не ваша семья позволить себе костюм для вас

How do I open file:

ifstream readFile(fileData.c_str());
while (!readFile.eof())
{
  std::getline(readFile, buffer);
  ...
}

The first trouble, there is some symbol before text 'en' (I saw this in debugger):

"en: least"

And another trouble is cyrillic symbols:

" ru: наименьший"

What's wrong?

A: 

Use libiconv to convert the text to a usable encoding after reading.

Ignacio Vazquez-Abrams
A: 

Use icu to convert the text.

bmargulies
+3  A: 

there is some symbol before text 'en'

That's a faux-BOM, the result of encoding a U+FEFF BYTE ORDER MARK character into UTF-8.

Since UTF-8 is an encoding that does not have a byte order, the faux-BOM shouldn't ever be used, but unfortunately quite a bit of existing software (especially in the MS world) does nonetheless. Load the messages file into a text editor and save it back out again as UTF-8, using a “UTF-8 without BOM” encoding if one is especially listed.

ru: наименьший

That's what you get when you've got a UTF-8 byte string (representing наименьший) and you print it as if it were a Code Page 1252 (Windows Western European) byte string. It's not an input problem; you have read in the string OK and have a UTF-8 byte string. But then, in code you haven't quoted, it gets output as cp1252.

If you're just printing it to the console, this is to be expected, as the console always uses the system default code page (1252 on a Western Windows install), and not UTF-8. If you need to send Unicode to the console you'll have to convert the bytes to native-Unicode wchar​s and write them from there. I don't know what the final destination for your strings is though... if you're just going to write them to another file or something you could just keep them as bytes and not care about what encoding they're in.

bobince
+1  A: 

i suppose that your os is windows. exists several ways simple:

  1. Use wchar_t, wstring, wifstream, etc.
  2. Use icu library
  3. Use other super puper library (them really many)

Note: for console printing you must use WinApi functions to convert UTF-8 to cp866 (my default cyrilic windows encoding cp1251) because of windows console supports only dos encodings.

Note: for file printing you need to know what encoding use your file

den bardadym