ansaurus

Question

STL and UTF-8 file input/output. How to do it?

Answer 1

A:

You can't make STL to directly work with UTF-8. The basic reason is that STL indirectly forbids multi-char characters. Each character has to be one char/wchar_t.

Microsoft actually breaks the standard with their UTF-16 encoding, so maybe you can get some inspiration there.

Let_Me_Be 2010-10-25 20:07:29

In first snippet if I omit _setmode wcout doesn't work. I thought it would be same with files too. And after executing second snippet I tested it, the output file was a legit UTF-8.

transistor09 2010-10-25 20:21:44

@transistor09 Yes, you can read and output UTF-8, but you can't store UTF-8 in any other way then raw data or encoded in UTF-32 (UTF-16 in Windows).

Let_Me_Be 2010-10-25 20:40:12

@Let_Me_Be That's what I want: keep wchar_t in RAM and store UTF-8 in HDD.

transistor09 2010-10-25 21:07:21

-1, simply untrue. `char` can have a multibyte encoding, and `std::string` doesn't change that. The Standard Library explicitly confirms it, in fact: There's no reason to have `std::codecvt::max_length` if it would be always be 1

MSalters 2010-10-26 08:52:24

@MSalters That specifies the number of external characters matching one internal! Not the other way around.

Let_Me_Be 2010-10-26 10:03:13

@transistor09 Check the UTF-8 facet mentioned in the comment to your question. That is the correct approach.

Let_Me_Be 2010-10-26 10:04:13

@Let_Me_Be: The specialiazation `codecvt<char,char,mbstate_t>` is a no-op. In the specialization `codecvt<wchar_t,char,mbstate_t>`, the "internal" character is `wchar_t` and the "external" characters are chars, up to `char[codecvt<wchar_t,char,mbstate_t>::max_length]`

MSalters 2010-10-26 10:58:17

@MSalters I have no idea why you are arguing with me. What I saying is that you can't have an UTF-8 or UTF-16 std::string (std::basic_string) and not break the standard. The external encoding can be anything. I'm talking about the internal encoding.

Let_Me_Be 2010-10-26 11:07:39

@Let_Me_Be: Sorry, but that is just not true. I cannot prove the absence of such a rule, but if you think there's a rule in the standard you should be able to point to it. Should be in chapter 20 somewhere?

MSalters 2010-10-26 11:16:39

@MSalters It is indirectly forbidden. For example, the definition of null terminated wide character sequence defines the length as the number of elements before the NULL character. If you have a variable length encoding, then this is not true.

Let_Me_Be 2010-10-26 11:34:31

@Let_Me_Be: A "null terminated wide character sequence" is not an STL or a `std::basic_string<charT>` concept. It's a basic C concept, carried over to C++. Besides, UTF-8 would be used for Multi-Byte encodings (`char[]`), not wide characters (`wchar_t[]`).

MSalters 2010-10-27 08:12:41

@MSalters `char` has even stricter limitations then `wchar_t`.

Let_Me_Be 2010-10-27 09:00:36

@Let_Me_Be : no it doesn't. The standard explicitly acknowledges multi-byte character sequences. The library agrees, there's no point in having mcslen unless it differs from strlen. Furthermore, that still is the C legacy. You _still_ haven't indicated how the STL (`std::basic_string`) would forbid it.

MSalters 2010-10-27 09:53:32

@MSalters Well, obviously yes you need `wcslen` in both cases, simply because it takes `const wchar_t*` and not `const char*`.

Let_Me_Be 2010-10-27 10:32:09

@Let_Me_Be :Sorry, mixup there. `mcslen` is not standard. But I found the even clearer example: `MB_CUR_MAX` is the maximum number of chars in a single multi-byte character. Eseentially, you state that the standard requires that `MB_CUR_MAX==1`. Why is it then even a #define ?!

MSalters 2010-10-27 11:11:52

@MSalters `MB_CUR_MAX` should be number of bytes per `wchar_t` not number of `wchar_t` per platform-character. At least that is how I understand the description.

Let_Me_Be 2010-10-27 11:17:48

A wide character is not a multibyte character, within the language of the standard. [defns.multibyte]: "a sequence of one or more bytes representing a member of the extended character set of either the source or the execution environment." [lib.locale.codecvt]: "The class codecvt<internT,externT,stateT> is for use when converting from one codeset to another, such as from wide characters to multibyte characters." Also, the number of bytes per `wchar_t` is `sizeof wchar_t`.

Steve M 2010-10-27 17:32:43

@Let_Me_Be: `MB_CUR_MAX` has nothing to do with `wchar_t`.

MSalters 2010-10-28 11:27:27

Answer 2

A:

The easiest way would be to do the conversion to UTF-8 yourself before trying to output. You might get some inspiration from this question: http://stackoverflow.com/questions/148403/utf8-to-from-wide-char-conversion-in-stl

Mark Ransom 2010-10-25 20:12:03

I know how to convert (I wrote a decoder), but I was wondering if there was a less troublesome way.

transistor09 2010-10-25 20:25:46

IMHO codecvt facet (once implemented) is a convenient way to perform conversions in STL. Why troublesome?

Basilevs 2010-10-26 17:02:59

_Less troublesome_ would be if you could memorize it. Despite that, I agree that facet is convenient in most cases (until standards offer something internal).

transistor09 2010-10-26 18:33:33

Answer 3

A:

get FILE* or integer file handle form a std::basic_*fstream?

Answered elsewhere.

Mike DeSimone 2010-10-25 20:13:13

I tried this one out but it just didn't work.

transistor09 2010-10-26 18:50:29

Um, scratch that, FILE* fp; wofstream fs (fp); seems to work just fine!

transistor09 2010-10-26 18:59:19

Answer 4

A:

Use std::codecvt_facet template to perform the conversion.

You may use standard std::codecvt_byname, or a non-standard codecvt_facet implementation.

#include <locale>
using namespace std;
typedef codecvt_facet<wchar_t, char, mbstate_t> Cvt;
locale utf8locale(locale(), new codecvt_byname<wchar_t, char, mbstate_t> ("en_US.UTF-8"));
wcout.pubimbue(utf8locale);
wcout << L"Hello, wide to multybyte world!" << endl;

Beware that on some platforms codecvt_byname can only emit conversion only for locales that are installed in the system.

Basilevs 2010-10-26 16:54:35

Answer 5

A:

Well, after some testing I figured out that FILE is accepted for _iobuf (in the w*fstream constructor). So, the following code does what I need.

#include <iostream>
#include <fstream>
#include <io.h>
#include <fcntl.h>
//For writing
    FILE* fp;
    _wfopen_s (&fp, L"utf-8_out_test.txt", L"w");
    _setmode (_fileno (fp), _O_U8TEXT);
    wofstream fs (fp);
    fs << L"ąﬂ";
    fclose (fp);
//And reading
    FILE* fp;
    _wfopen_s (&fp, L"utf-8_in_test.txt", L"r");
    _setmode (_fileno (fp), _O_U8TEXT);
    wifstream fs (fp);
    wchar_t array[6];
    fs.getline (array, 5);
    wcout << array << endl;//For debug
    fclose (fp);

This sample reads and writes legit UTF-8 files (without BOM) in Windows compiled with Visual Studio 2k8.

Can someone give any comments about portability? Improvements?

transistor09 2010-10-26 19:55:49

ansaurus

tags:

views:

answers:

STL and UTF-8 file input/output. How to do it?

related questions