views:

981

answers:

3

Hey,

I've being trying to figure this out, but nothing seems to work. We have an application that reads thousands of transactions files using the normal "fopen fgets etc", which we parse using normal C functions "strstr, strchr, etc" and return back a normalized char *.

However, now we need to read some files that are in Unicode (from Windows) and I am having a lot of trouble. From what I am working on, I only receive a FP (file pointer) without knowing if the FP points to a normal ascii file or Unicode and I need to send back to the application as char *.

I also can not run command line tools to manually convert the whole file, because we are tailing it for new entries.

I tried using WideCharToMultiByte, mbsrtowcs, but it seems that after I read the file using fgets, and pass to them, the return is always empty (0 bytes). Anyone have any example on how to do it properly? The online docs/manuals for these functions all miss good examples.

Thanks!

+4  A: 

I don't have the full answer, but part of the problem is determining the character encoding. Normally unicode format files created in windows will start with a byte-order-mark (BOM) - the unicode character U+FEFF. This can be used to determine what the encoding is, if one is found.

If you have a string encoded using say UTF16, this will have any number of embedded NULL bytes, you cannot use the normal ASCII versions of the string functions (strlen and so on), as they will see the NULL bytes as the end of string marker. Your standard library will have unicode enabled versions that you should use.

1800 INFORMATION
Thanks. So I can check that to know if the file is unicode or not...
It is interesting that if I run "file" on this unicode file on Linux, I get: "MPEG 1.0 Layer I, 96 kbit/s, 44100 Hz stereo"
It's a good thing we have this byte order mark! =)
toto
It's great! It converts all your boring text files to mpegs for your listening pleasure
1800 INFORMATION
+3  A: 

That's one of the problems with character encodings -- either you have to assume that it's in some encoding, you have to get that information from inside the data or from metadata, or you have to detect that.

On Windows, it's common to use byte-order mark at the beginning of file, but this violates many practices and breaks a lot of things -- so it's not common in unix world.

There's a bunch of libraries devoted just for that -- unicode and character encodings. Most popular are iconv and ICU.

HMage
A: 

A few points:

If you can be sure that the UNICODE files have a Byte Order Mark (BOM) you can look out for that. However UNICODE files are not required to have a BOM, so it depends on where they come from.

If the file is UNICODE, you cannot read it with fgets() you need to use fgetws() or fread(). UNICODE characters may have zero bytes (bytes with a value of zero) which will confuse fgets().

The zero bytes can be your friend. If you read in a lump of the file using fread(), and discover embedded zero bytes, it is likely that you have UNICODE. However the reverse is not true -- the absence of zero bytes does not prove that you have ASCII. English letters in UNICODE will have zero bytes, but many other languages (e.g. Chinese) will not.

If you know what language the text is in, you can test for characters that are not valid in that language -- but it is a bit hit and miss.

In the above, I am using "UNICODE" in the Windows way -- to refer to UTF16 with Intel byte ordering. However in the real world you could get UTF8 or UTF32 and you might get non-Intel byte ordering. (Theoretically you could get UTF7, but that is pretty rare).

If you have control over the input files, you can insist that they have BOMs, which makes it easy.

Failing that, if you know the language of the files you can try to guess the encoding, but that is less than 100% reliable. Otherwise, you might need to ask the operator (if there is one) to specify the encoding.

Michael J