views:

309

answers:

2

Hello everyone,

sorry if the question is stupid and has been asked thousands of times but I spent a few hours googling it and could not find an answer.

I want to read in text file which can be any of these: ASCII/UTF-8/UTF-16 BE/LE I assume that if file is unicode then BOM is always present.

Is there any automatic way (STL,Boost or something else) to use file stream or anything to read in file line by line without checking BOMs and always getting UTF8 to put into std::string?

In this project I am using Windows only. It would also be good to know how to solve it for other platforms.

Thanks in advance!

+2  A: 

libiconv

Ignacio Vazquez-Abrams
Somewhere deep in my heart I was hoping not to use additional libraries.Thanks a lot for rapid reply!
Andrew
Can you, please, hint me how this lib can be used to read a file? I found only conversion routines which means that I need to write my own processing of the input and ask it to convert manually
Andrew
It can't actually be used to read a file directly; you'll need to use something like `fgets()` to read the text, and then you can put it through a conversion descriptor.
Ignacio Vazquez-Abrams
+2  A: 

BOMs are often not present in UTF-8 files. As a consequence, you can't know if a file is ASCII or UTF-8 until after you have read the data and found a byte which isn't ASCII.

Furthermore, as you are on Windows, do you intend to handle ISO-8859-1 and Windows-1252 as well? The later is often the default for files from things like Notepad and Wordpad. In these cases, things are even worse: One can only distinguish heuristically between such encodings, other encodings and UTF-8.

The ICU library has a character set detection system that you can use to guess the likely character encoding of a file. I do not believe that iconv has such a function.

ICU is generally available, already installed on Mac and Linux, but, alas, not Windows. Such a routine might be available in Win32 API as well.

MtnViewMark
All valid ASCII files are also valid UTF-8 files.
Ignacio Vazquez-Abrams
True enough! If his original purpose is all he needs, then yes using a BOM to detect UTF-16 variants and absence of BOM to assume UTF-8 (or ASCII) will work. But, in the more general case, when dealing with text files, those other encoding will cause this to fail.
MtnViewMark
Thanks for a good remark but I assume that codepages would not be my issue because their treatment will definitely become a trouble.
Andrew