Nasty unicode and C++: Easy way to read ASCII/UTF-8/UTF-16 BE/LE text file

views:

309

answers:

+1 Q:

Nasty unicode and C++: Easy way to read ASCII/UTF-8/UTF-16 BE/LE text file

Hello everyone,

sorry if the question is stupid and has been asked thousands of times but I spent a few hours googling it and could not find an answer.

I want to read in text file which can be any of these: ASCII/UTF-8/UTF-16 BE/LE I assume that if file is unicode then BOM is always present.

Is there any automatic way (STL,Boost or something else) to use file stream or anything to read in file line by line without checking BOMs and always getting UTF8 to put into std::string?

In this project I am using Windows only. It would also be good to know how to solve it for other platforms.

Thanks in advance!

+2 A:

libiconv

Ignacio Vazquez-Abrams 2010-01-18 06:49:25

Somewhere deep in my heart I was hoping not to use additional libraries.Thanks a lot for rapid reply!

Andrew 2010-01-18 06:50:46

Can you, please, hint me how this lib can be used to read a file? I found only conversion routines which means that I need to write my own processing of the input and ask it to convert manually

Andrew 2010-01-18 07:11:22

It can't actually be used to read a file directly; you'll need to use something like `fgets()` to read the text, and then you can put it through a conversion descriptor.

Ignacio Vazquez-Abrams 2010-01-18 07:16:50

+2 A:

BOMs are often not present in UTF-8 files. As a consequence, you can't know if a file is ASCII or UTF-8 until after you have read the data and found a byte which isn't ASCII.

Furthermore, as you are on Windows, do you intend to handle ISO-8859-1 and Windows-1252 as well? The later is often the default for files from things like Notepad and Wordpad. In these cases, things are even worse: One can only distinguish heuristically between such encodings, other encodings and UTF-8.

The ICU library has a character set detection system that you can use to guess the likely character encoding of a file. I do not believe that iconv has such a function.

ICU is generally available, already installed on Mac and Linux, but, alas, not Windows. Such a routine might be available in Win32 API as well.

MtnViewMark 2010-01-19 20:49:50

All valid ASCII files are also valid UTF-8 files.

Ignacio Vazquez-Abrams 2010-01-19 20:56:46

True enough! If his original purpose is all he needs, then yes using a BOM to detect UTF-16 variants and absence of BOM to assume UTF-8 (or ASCII) will work. But, in the more general case, when dealing with text files, those other encoding will cause this to fail.

MtnViewMark 2010-01-20 03:15:17

Thanks for a good remark but I assume that codepages would not be my issue because their treatment will definitely become a trouble.

Andrew 2010-01-20 06:56:48

ansaurus

tags:

views:

answers:

Nasty unicode and C++: Easy way to read ASCII/UTF-8/UTF-16 BE/LE text file

related questions