tags:

views:

98

answers:

5

I use wchar_t for internal strings and UTF-8 for storage in files. I need to use STL to input/output text to screen and also do it by using full Lithuanian charset.
It's all fine because I'm not forced to do the same for files, so the following example does the job just fine:

#include <io.h>
#include <fcntl.h>
#include <iostream>
    _setmode (_fileno(stdout), _O_U16TEXT);
    wcout << L"AaĄąfl" << endl;
But I became curious and attempted to do the same with files with no success. Of course I could use formatted input/output, but that is... discouraged.
    FILE* fp;
    _wfopen_s (&fp, L"utf-8_out_test.txt", L"w");
    _setmode (_fileno (fp), _O_U8TEXT);
    _fwprintf_p (fp, L"AaĄą\nfl");
    fclose (fp);
    _wfopen_s (&fp, L"utf-8_in_test.txt", L"r");
    _setmode (_fileno (fp), _O_U8TEXT);
    wchar_t text[256];
    fseek (fp, NULL, SEEK_SET);
    fwscanf (fp, L"%s", text);
    wcout << text << endl;
    fwscanf (fp, L"%s", text);
    wcout << text << endl;
    fclose (fp);
This snippet works perfectly (although I am not sure how it handles malformed chars). So, is there any way to:

  • get FILE* or integer file handle form a std::basic_*fstream?
  • simulate _setmode () on it?
  • extend std::basic_*fstream so it handles UTF-8 I/O?

Yes, I am studying at an university and this is somewhat related to my assignments, but I am trying to figure this out for myself. It won't influence my grade or anything like that.

A: 

You can't make STL to directly work with UTF-8. The basic reason is that STL indirectly forbids multi-char characters. Each character has to be one char/wchar_t.

Microsoft actually breaks the standard with their UTF-16 encoding, so maybe you can get some inspiration there.

Let_Me_Be
In first snippet if I omit _setmode wcout doesn't work. I thought it would be same with files too. And after executing second snippet I tested it, the output file was a legit UTF-8.
transistor09
@transistor09 Yes, you can read and output UTF-8, but you can't store UTF-8 in any other way then raw data or encoded in UTF-32 (UTF-16 in Windows).
Let_Me_Be
@Let_Me_Be That's what I want: keep wchar_t in RAM and store UTF-8 in HDD.
transistor09
-1, simply untrue. `char` can have a multibyte encoding, and `std::string` doesn't change that. The Standard Library explicitly confirms it, in fact: There's no reason to have `std::codecvt::max_length` if it would be always be 1
MSalters
@MSalters That specifies the number of external characters matching one internal! Not the other way around.
Let_Me_Be
@transistor09 Check the UTF-8 facet mentioned in the comment to your question. That is the correct approach.
Let_Me_Be
@Let_Me_Be: The specialiazation `codecvt<char,char,mbstate_t>` is a no-op. In the specialization `codecvt<wchar_t,char,mbstate_t>`, the "internal" character is `wchar_t` and the "external" characters are chars, up to `char[codecvt<wchar_t,char,mbstate_t>::max_length]`
MSalters
@MSalters I have no idea why you are arguing with me. What I saying is that you can't have an UTF-8 or UTF-16 std::string (std::basic_string) and not break the standard. The external encoding can be anything. I'm talking about the internal encoding.
Let_Me_Be
@Let_Me_Be: Sorry, but that is just not true. I cannot prove the absence of such a rule, but if you think there's a rule in the standard you should be able to point to it. Should be in chapter 20 somewhere?
MSalters
@MSalters It is indirectly forbidden. For example, the definition of null terminated wide character sequence defines the length as the number of elements before the NULL character. If you have a variable length encoding, then this is not true.
Let_Me_Be
@Let_Me_Be: A "null terminated wide character sequence" is not an STL or a `std::basic_string<charT>` concept. It's a basic C concept, carried over to C++. Besides, UTF-8 would be used for Multi-Byte encodings (`char[]`), not wide characters (`wchar_t[]`).
MSalters
@MSalters `char` has even stricter limitations then `wchar_t`.
Let_Me_Be
@Let_Me_Be : no it doesn't. The standard explicitly acknowledges multi-byte character sequences. The library agrees, there's no point in having mcslen unless it differs from strlen. Furthermore, that still is the C legacy. You _still_ haven't indicated how the STL (`std::basic_string`) would forbid it.
MSalters
@MSalters Well, obviously yes you need `wcslen` in both cases, simply because it takes `const wchar_t*` and not `const char*`.
Let_Me_Be
@Let_Me_Be :Sorry, mixup there. `mcslen` is not standard. But I found the even clearer example: `MB_CUR_MAX` is the maximum number of chars in a single multi-byte character. Eseentially, you state that the standard requires that `MB_CUR_MAX==1`. Why is it then even a #define ?!
MSalters
@MSalters `MB_CUR_MAX` should be number of bytes per `wchar_t` not number of `wchar_t` per platform-character. At least that is how I understand the description.
Let_Me_Be
A wide character is not a multibyte character, within the language of the standard. [defns.multibyte]: "a sequence of one or more bytes representing a member of the extended character set of either the source or the execution environment." [lib.locale.codecvt]: "The class codecvt<internT,externT,stateT> is for use when converting from one codeset to another, such as from wide characters to multibyte characters." Also, the number of bytes per `wchar_t` is `sizeof wchar_t`.
Steve M
@Let_Me_Be: `MB_CUR_MAX` has nothing to do with `wchar_t`.
MSalters
A: 

The easiest way would be to do the conversion to UTF-8 yourself before trying to output. You might get some inspiration from this question: http://stackoverflow.com/questions/148403/utf8-to-from-wide-char-conversion-in-stl

Mark Ransom
I know how to convert (I wrote a decoder), but I was wondering if there was a less troublesome way.
transistor09
IMHO codecvt facet (once implemented) is a convenient way to perform conversions in STL. Why troublesome?
Basilevs
_Less troublesome_ would be if you could memorize it. Despite that, I agree that facet is convenient in most cases (until standards offer something internal).
transistor09
A: 

get FILE* or integer file handle form a std::basic_*fstream?

Answered elsewhere.

Mike DeSimone
I tried this one out but it just didn't work.
transistor09
Um, scratch that, FILE* fp; wofstream fs (fp); seems to work just fine!
transistor09
A: 

Use std::codecvt_facet template to perform the conversion.

You may use standard std::codecvt_byname, or a non-standard codecvt_facet implementation.

#include <locale>
using namespace std;
typedef codecvt_facet<wchar_t, char, mbstate_t> Cvt;
locale utf8locale(locale(), new codecvt_byname<wchar_t, char, mbstate_t> ("en_US.UTF-8"));
wcout.pubimbue(utf8locale);
wcout << L"Hello, wide to multybyte world!" << endl;

Beware that on some platforms codecvt_byname can only emit conversion only for locales that are installed in the system.

Basilevs
A: 

Well, after some testing I figured out that FILE is accepted for _iobuf (in the w*fstream constructor). So, the following code does what I need.

#include <iostream>
#include <fstream>
#include <io.h>
#include <fcntl.h>
//For writing
    FILE* fp;
    _wfopen_s (&fp, L"utf-8_out_test.txt", L"w");
    _setmode (_fileno (fp), _O_U8TEXT);
    wofstream fs (fp);
    fs << L"ąfl";
    fclose (fp);
//And reading
    FILE* fp;
    _wfopen_s (&fp, L"utf-8_in_test.txt", L"r");
    _setmode (_fileno (fp), _O_U8TEXT);
    wifstream fs (fp);
    wchar_t array[6];
    fs.getline (array, 5);
    wcout << array << endl;//For debug
    fclose (fp);
This sample reads and writes legit UTF-8 files (without BOM) in Windows compiled with Visual Studio 2k8.

Can someone give any comments about portability? Improvements?

transistor09