views:

544

answers:

3

I've read a few answers on here about reading Unicode files etc and most people point to UTF8-CPP or iconv.

None of the libraries that I found work for both ANSI and Unicode files, ideally I want one function which I pass a filename to and it will return the contents of that file and it won't matter what the encoding is, or is that not possible?

If so any suggestions how I would go about it?

+2  A: 

Well, that is a binary read isn't ? All other forms are a matter of interpretation and then the exact encoding etc becomes important.

While due to BOM, unicode can be autodetected in some cases, it's not always there, and a failed detection mechanism then equals a broken program. I assume that is the reason why most people refrain from it.

Marco van de Voort
maybe he's referring to unicode file names..
Assaf Lavie
A: 

UTF8-CPP can detecte UTF8 (utf8::is_valid and utf8::find_invalid) and do conversion (utf8::utf16to8) if you're using widestring. It works great, what's the problem ?

anno
A: 

You can use a combination of techniques:

In general, most Unicode files start with the BOM. Which means if the file starts with 0xfffe or 0xfeff you may assume that it's Unicode. Not many people use UTF-32 AFAIK but you can still use the BOM method to guess (refer to Wiki page).

If it's a UTF-8 file, you can use UTF8-CPP to convert it UTF-16 (wstring). If it's a UTF-16 file, it can be difficult to read using the standard library. For taht, you can refer to my blog post:

http://cfc.kizzx2.com/index.php/reading-a-unicode-utf16-file-in-windows-c/

For UTF-32 -- I don't know if anyone uses it, so I have no experience :P

kizzx2