views:

842

answers:

7
+2  Q: 

Read Unicode Files

Hello,

i have a problem reading and using the content from unicode files .

i am working on a unicode release build, and i am trying to read the content from an unicode file but the data has strange characters and i can't seem to find a way to convert data to ASCII .

im using fgets, tried fgetws,WideCharToMultiByte and alot of functions which i found in other articles and posts but nuttin worked .

if anyone has been through this and has a successful to help me please post . Thank you .

+1  A: 

We'll need more information to answer the question (for example, are you trying to read the Unicode file into a char buffer or a wchar_t buffer? What encoding does the file use?), but for now you might want to make sure you're not running into this issue if your file is Unicode and you're using fgetws in text mode.

When a Unicode stream-I/O function operates in text mode, the source or destination stream is assumed to be a sequence of multibyte characters. Therefore, the Unicode stream-input functions convert multibyte characters to wide characters (as if by a call to the mbtowc function). For the same reason, the Unicode stream-output functions convert wide characters to multibyte characters (as if by a call to the wctomb function).

Nick Meyer
+1  A: 

Unicode is the mapping from numerical codes into characters. The step before Unicode is the file's encoding: how do you transform some consequtive bytes into a numerical code? You have to check whether the file is stored as big-endian, little-endian or something else.

Often, the BOM (Byte order marker) is written as the first two bytes in the file: either FF FE or FE FF.

xtofl
+1  A: 

The intended way of handling charsets is to let the locale system do it.

You have to have set the correct locale before opening your stream.

BTW you tag your question C++, you wrote about fgets and fgetws but not IOStreams; is your problem C++ or C ?

For C:

#include <locale.h>
setlocale(LC_ALL, ""); /* at least LC_CTYPE */

For C++

#include <locale>
std::locale::global(std::locale(""));

Then wide IO (wstream, fgetws) should work if you environment is correctly set for Unicode. If not, you'll have to change your environment (I don't how it works under Windows, for Unix, setting the LC_ALL variable is the way, see locale -a for supported values). Alternatively, replacing the empty string by the locale would also work, but then you hardcode the locale in your program and your users won't perhaps appreciate that.

If your system doesn't support an adequate locale, in C++ have the possibility to write a facet for the conversion yourself. But that outside of the scope of this answer.

AProgrammer
A: 

First: I assume you are trying to read UTF8-Encoded Unicode (since you can read some characters). You can check this for example in Notpad++

For your problem - I'd suggest using some sort of library. You could try QT, QFile supports Unicode (as well as the rest of the library). You'll find it here: http://www.qtsoftware.com/.

If this is too much, use a special unicode-library like for example: http://utfcpp.sourceforge.net/.

And learn about unicode: http://en.wikipedia.org/wiki/Unicode. There you'll find references to the different unicode-encodings.

Tobias Langner
+3  A: 

Because you mention WideCharToMultiByte I will assume you are dealing with Windows.

"read the content from an unicode file ... find a way to convert data to ASCII"

This might be a problem. If you convert Unicode to ASCII (or other legacy code page) you will run into the risk of corrupting/loosing data. Since you are "working on a unicode release build" you will want to read Unicode and stay Unicode.

So your final buffer will have to be wchar_t (or WCHAR, or CStringW, same thing).

So your file might be utf-16, or utf-8 (utf-32 is quite rare). For utf-16 the endianess might also matter. If there is a BOM that will help a lot.

Quick steps:

  • open file with wopen, or _wfopen as binary
  • read the first bytes to identify encoding using the BOM
  • if the encoding is utf-8, read in a byte array and convert to wchar_t with WideCharToMultiByte and CP_UTF8
  • if the encoding is utf-16be (big endian) read in a wchar_t array and _swab
  • if the encoding is utf-16le (little endian) read in a wchar_t array and you are done

Also (if you use a newer Visual Studio), you might take advantage of an MS extension to _wfopen. It can take an encoding as part of the mode (something like _wfopen(L"newfile.txt", L"rw, ccs=<encoding>"); with the encoding being UTF-8 or UTF-16LE). It can also detect the encoding based on the BOM.

Warning: to be cross-platform is problematic, wchar_t can be 2 or 4 bytes, the conversion routines are not portable...

Useful links:

Mihai Nita
That is exactly the message of this: "If you convert Unicode to ASCII (or other legacy code page) you will run into the risk of corrupting/loosing data"
Mihai Nita
Sorry, my comment here should have been posted as an answer to the question. *Your* answer is correct.
DaveE
A: 

Thanks for your answers - my file is indeed UTF16, i've opened the file with ultraedit and the (BOM is) file starts with FF FE ; will go ahead and try some of the advices i got here , hope i will get it right this time . thnx all

seven
A: 

You CANNOT reliably convert Unicode, even UTF-8, to ASCII. The character sets ('planes' in Unicode documentation) do not map back to ASCII - that's why Unicode exists in the first place.

DaveE