views:

1488

answers:

4

Hello!

I'm currently working on a MFC program that specifically has to work with UTF-8. At some point, I have to write UTF-8 data into a file; to do that, I'm using CFiles and CStrings.

When I get to write utf-8 (russian characters, to be more precise) data into a file, the output looks like

Ðàñïå÷àòàíî:
Ñèñòåìà
Ïðîèçâîäñòâî

and etc. This is assurely not utf-8. To read this data properly, I have to change my system settings; changing non ASCII characters to a russian encoding table does work, but then all my latin based non-ascii characters get to fail. Anyway, that's how I do it.

CFile CSVFile( m_sCible, CFile::modeCreate|CFile::modeWrite);
CString sWorkingLine;
//Add stuff into sWorkingline
CSVFile.Write(sWorkingLine,sWorkingLine.GetLength());
//Clean sWorkingline and start over

Am I missing something? Shall I use something else instead? Is there some kind of catch I've missed? I'll be tuned in for your wisdom and experience, fellow programmers.

EDIT: Of course, as I just asked a question, I finally find something which might be interesting, that can be found here. Thought I might share it.

EDIT 2:

Okay, so I added the BOM to my file, which now contains chineese character, probably because I didn't convert my line to UTF-8. To add the bom I did...

char BOM[3]={0xEF, 0xBB, 0xBF};
CSVFile.Write(BOM,3);

And after that, I added...

    TCHAR TestLine;
    //Convert the line to UTF-8 multibyte.
    WideCharToMultiByte (CP_UTF8,0,sWorkingLine,sWorkingLine.GetLength(),TestLine,strlen(TestLine)+1,NULL,NULL);
    //Add the line to file.
    CSVFile.Write(TestLine,strlen(TestLine)+1);

But then I cannot compile, as I don't really know how to get the length of TestLine. strlen doesn't seem to accept TCHAR. Fixed, used a static lenght of 1000 instead.

EDIT 3:

So, I added this code...

    wchar_t NewLine[1000];
    wcscpy( NewLine, CT2CW( (LPCTSTR) sWorkingLine ));
    TCHAR* TCHARBuf = new TCHAR[1000];

    //Convert the line to UTF-8 multibyte.
    WideCharToMultiByte (CP_UTF8,0,NewLine,1000,TCHARBuf,1000,NULL,NULL);

    //Find how many characters we have to add
    size_t size = 0;
    HRESULT hr = StringCchLength(TCHARBuf, MAX_PATH, &size);

    //Add the line to the file
    CSVFile.Write(TCHARBuf,size);

It compiles fine, but when I go look at my new file, it's exactly the same as when I didn't have all this new code (ex : Ðàñïå÷àòàíî:). It feels like I didn't do a step forward, although I guess only a small thing is what separates me from victory.

EDIT 4:

I removed previously added code, as Nate asked, and I decided to use his code instead, meaning that now, when I get to add my line, I have...

        CT2CA outputString(sWorkingLine, CP_UTF8);

    //Add line to file.
    CSVFile.Write(outputString,::strlen(outputString));

Everything compiles fine, but the russian characters are shown as ???????. Getting closer, but still not that. Btw, I'd like to thank everyone who tried/tries to help me, it is MUCH appreciated. I've been stuck on this for a while now, I can't wait for this problem to be gone.

FINAL EDIT (I hope) By changing the way I first got my UTF-8 characters (I reencoded without really knowing), which was erroneous with my new way of outputting the text, I got acceptable results. By adding the UTF-8 BOM char at the beginning of my file, it could be read as Unicode in other programs, like Excel.

Hurray! Thank you everyone!

+4  A: 

You'll have to convert sWorkingLine to UTF-8 and then write it in the file.

WideCharToMultiByte can convert unicode strings to UTF-8 if you select the CP_UTF8 codepage. MultiByteToWideChar can convert ASCII chars to unicode.

Nick D
By using such a function, will all the included text be changed to more than one byte, or just the non-ascii chars?
SeargX
@SeargX, only the non-ascii if you use UTF-8.
Nick D
@Nick D : Perfect, thanks.@EveryoneWhich type of string should I put my converted data in? TCHAR? How do I determine the length of the line, which is needed in the multibytetowidechar function?
SeargX
+1  A: 

Make sure you're using Unicode (TCHAR is wchar_t). Then before you write the data, convert it using the WideCharToMultiByte Win32 API function.

A: 

Hi..:) I can basically tell you while your dealing with Utf 8 or basically Unicode.. Scripting languages like Python and Perl really simplify your job, I may not be able to specifically help you with this as I also am facing problems using Unicode/utf8 in C++.. but I am learning and discovering things to be simpler through python for e.g.. check out www.python.org for details and tutorials...

mgj
Okay, thanks for the info ;)Altough I can't change my language, as I'm not working on my own behalf, and that my whole program, a few bugs apart, is already written. =P
SeargX
+1  A: 

When you output data you need to do (this assumes you are compiling in Unicode mode, which is highly recommended):

CString russianText = L"Привет мир";

CFile yourFile(_T("yourfile.txt"), CFile::modeWrite | CFile::modeCreate);

CT2CA outputString(russianText, CP_UTF8);
yourFile.Write(outputString, ::strlen(outputString));

If _UNICODE is not defined (you are working in multi-byte mode instead), you need to know what code page your input text is in and convert it to something you can use. This example shows working with Russian text that is in UTF-16 format, saving it to UTF-8:

// Example 1: convert from Russian text in UTF-16 (note the "L"
// in front of the string), into UTF-8.
CW2A russianTextAsUtf8(L"Привет мир", CP_UTF8);
yourFile.Write(russianTextAsUtf8, ::strlen(russianTextAsUtf8));

More likely, your Russian text is in some other code page, such as KOI-8R. In that case, you need to convert from the other code page into UTF-16. Then convert the UTF-16 into UTF-8. You cannot convert directly from KOI-8R to UTF-8 using the conversion macros because they always try to convert narrow text to the system code page. So the easy way is to do this:

// Example 2: convert from Russian text in KOI-8R (code page 20866)
// to UTF-16, and then to UTF-8. Conversions between UTFs are
// lossless.
CA2W russianTextAsUtf16("\xf0\xd2\xc9\xd7\xc5\xd4 \xcd\xc9\xd2", 20866);
CW2A russianTextAsUtf8(russianTextAsUtf16, CP_UTF8);
yourFile.Write(russianTextAsUtf8, ::strlen(russianTextAsUtf8));

You don't need a BOM (it's optional; I wouldn't use it unless there was a specific reason to do so).

Make sure you read this: http://msdn.microsoft.com/en-us/library/87zae4a3(VS.80).aspx. If you incorrectly use CT2CA (for example, using the assignment operator) you will run into trouble. The linked documentation page shows examples of how to use and how not to use it.

Further information:

  • The C in CT2CA indicates const. I use it when possible, but some conversions only support the non-const version (e.g. CW2A).
  • The T in CT2CA indicates that you are converting from an LPCTSTR. Thus it will work whether your code is compiled with the _UNICODE flag or not. You could also use CW2A (where W indicates wide characters).
  • The A in CT2CA indicates that you are converting to an "ANSI" (8-bit char) string.
  • Finally, the second parameter to CT2CA indicates the code page you are converting to.

To do the reverse conversion (from UTF-8 to LPCTSTR), you could do:

CString myString(CA2CT(russianText, CP_UTF8));

In this case, we are converting from an "ANSI" string in UTF-8 format, to an LPCTSTR. The LPCTSTR is always assumed to be UTF-16 (if _UNICODE is defined) or the current system code page (if _UNICODE is not defined).

Nate
I tried what you said: I removed the BOM, and changed my code for yours. Now, the characters are represented as ??????? ??. Something is still missing, maybe? I'll post an edit.
SeargX
Represented as question marks where? Look at the resulting file using a hex editor. You should see something like [this](http://i.imgur.com/RcUsh.png). And if you open it in Notepad, you should see [this](http://imgur.com/Yl3OU.png). If not, then your original text is probably not encoded correctly. Hopefully you are using the `_UNICODE` define and your input is UTF-16. If not, you need to use the macros to convert from whatever code page the original text is in, to your desired code page.
Nate
The question marks are in the resulting file, and all have a question mark hex code (3F, I think).I am not using the _UNICODE define, and I don't think it would be a good idea. The russian characters I read come from an XML file, which I open with tinyXML, which doesn't support UTF-16, only UTF-8 and Latin 1 encoding pages. I guess I have to use the macros, although I'm not familiar with them.
SeargX
I was able to duplicate your problem by turning off Unicode mode. I updated the answer to show what to do. I would highly recommend using Unicode mode. With the macros you can easily convert text to or from UTF-8 for tinyXML, even with `_UNICODE` defined.
Nate
One more thing. If the data from tinyXML is already UTF-8 then no conversion is needed, right? You really need to know exactly what encoding each string is in at all times.
Nate
Actually, I've realised that I was using one of my company's library that encoded the text using a russian encoding table. I used tinyXML's functions instead, and I do get exactly what you posted on your images, meaning weird characters in hex mode and russian chars in notepad. Thanks!
SeargX
Also, by adding the BOM, i'm able to open my file with Excel (i'm creating a CSV file with my data) and the russian characters appear normally.Actually, I HAVE to double-thank you.*gives a hug to Nate*
SeargX
Awww... you're welcome.
Nate