views:

550

answers:

5

I need a file io library that can give my program a utf-16 (little endian) interface, but can handle files in other encodings, mainly ascii(input only), utf-8, utf-16, utf-32/ucs4 including both little and big endian byte orders.

Having looked around the only library I found was the ICU ustdio.h library.

I did try it however I coudlnt even get that to work with a very simple bit of text, and there is pretty much zero documentation on its useage, only the ICU file reference page which providse no examples and very little detail (eg having made a UFILE from an existing FILE, is it safe to use other functions that take the FILE*? along with several others...).

Also id far rather a c++ library that can give me a wide stream interface over a C style interface...

std::wstring str = L"Hello World in UTF-16!\nAnother line.\n";
UFILE *ufile = u_fopen("out2.txt", "w", 0, "utf-16");
u_file_write(str.c_str(), str.size(), ufile);
u_fclose(ufile);

output

Hello World in UTF-16!਍䄀渀漀琀栀攀爀 氀椀渀攀⸀ഀ

hex

FF FE 48 00 65 00 6C 00 6C 00 6F 00 20 00 57 00
6F 00 72 00 6C 00 64 00 20 00 69 00 6E 00 20 00
55 00 54 00 46 00 2D 00 31 00 36 00 21 00 0D 0A
00 41 00 6E 00 6F 00 74 00 68 00 65 00 72 00 20
00 6C 00 69 00 6E 00 65 00 2E 00 0D 0A 00

EDIT: The correct output on windows would be:

FF FE 48 00 65 00 6C 00 6C 00 6F 00 20 00 57 00 
6F 00 72 00 6C 00 64 00 20 00 69 00 6E 00 20 00 
55 00 54 00 46 00 2D 00 31 00 36 00 21 00 0D 00 
0A 00 41 00 6E 00 6F 00 74 00 68 00 65 00 72 00
20 00 6C 00 69 00 6E 00 65 00 2E 00 0D 00 0A 00
+1  A: 

I think the problems come from the 0D 0A 00 linebreaks. You could try if other linebreaks like \r\n or using LF or CR alone do work (best bet would be using \r, I suppose)

EDIT: It seems 0D 00 0A 00 is what you want, so you can try

std::wstring str = L"Hello World in UTF-16!\15\12Another line.\15\12";
schnaader
Tried that sort of stuff, \r works, \n is replaced by a broken \r\n, so \r\n in my string becomes 0D 00 0D 0A 00
Fire Lancer
Yes, I thought this would happen with \r\n. I even guess 0D 00 0A 00 would be bad because you would get 2 newlines instead of one.
schnaader
"(best bet would be using \r, I suppose)" Id rather use a library that is able to write files that are valid on the given platform, ie \r\n for dos/windows, \n for linux and \r for mac. Apart from the ar alone is likely to break lots of other stuff that uses the files that are expecting valid little endian utf-16 files with windows line breaks...
Fire Lancer
"0D 00 0A 00" is correct on windows, so thats exactly what I want it to output (and be able to read) as a new line. \r or \n are not correct for windows files.
Fire Lancer
A: 

UTF8-CPP gives you conversion between UTF-8, 16 and 32. Very nice and light library.

About ICU, some comments by the UTF8-CPP creator :

ICU Library. It is very powerful, complete, feature-rich, mature, and widely used. Also big, intrusive, non-generic, and doesn't play well with the Standard Library. I definitelly recommend looking at ICU even if you don't plan to use it.

:)

anno
+1  A: 

You can try the iconv (libiconv) library.

Vargas
+2  A: 

The problem you see comes from the linefeed conversion. Sadly, it is made at the byte level (after the code conversion) and is not aware of the encoding. IOWs, you have to disable the automatic conversion (by opening the file in binary mode, with the "b" flag) and, if you want 0A00 to be expanded to 0D00A00, you'll have to do it yourself.

You mention that you'd prefer a C++ wide-stream interface, so I'll outline what I did to achieve that in our software:

  • Write a std::codecvt facet using an ICU UConverter to perform the conversions.
  • Use an std::wfstream to open the file
  • imbue() your custom codecvt in the wfstream
  • Open the wfstream with the binary flag, to turn off the automatic (and erroneous) linefeed conversion.
  • Write a "WNewlineFilter" to perform linefeed conversion on wchars. Use inspiration from boost::iostreams::newline_filter
  • Use a boost::iostreams::filtering_wstream to tie the wfstream and the WNewlineFilter together as a stream.
Éric Malenfant
+3  A: 

I successfully worked with the EZUTF library posted on CodeProject: High Performance Unicode Text File I/O Routines for C++

vobject