views:

205

answers:

8

I have a wstring declared as such:

// random wstring
std::wstring str = L"abcàdëefŸg€hhhhhhhµa";

The literal would be UTF-8 encoded, because my source file is.

[EDIT: According to Mark Ransom this is not necessarily the case, the compiler will decide what encoding to use - let us instead assume that I read this string from a file encoded in e.g. UTF-8]

I would very much like to get this into a file reading (when text editor is set to the correct encoding)

abcàdëefŸg€hhhhhhhµa

but ofstream is not very cooperative (refuses to take wstring parameters), and wofstream supposedly needs to know locale and encoding settings. I just want to output this set of bytes. How does one normally do this?

EDIT: It must be cross platform, and should not rely on the encoding being UTF-8. I just happen to have a set of bytes stored in a wstring, and want to output them. It could very well be UTF-16, or plain ASCII.

+7  A: 

std::wstring is for something like UTF-16 or UTF-32, not UTF-8. For UTF-8, you probably just want to use std::string, and write out via std::cout. Just FWIW, C++0x will have Unicode literals, which should help clarify situations like this.

Jerry Coffin
Unfortunately I very much need wstring for UTF-8. UTF-8 code points can take up several bytes, and I need to be able to manipulate the string.
oystein
For the in-practice it's worth noting that newer versions of MinGW g++ (for Windows) support UTF-8 w/BOM, so that g++ can compile UTF-8 encoded source code that also can be compiled with Visual C++.
Alf P. Steinbach
Alf P. Steinbach
@alf: Are you saying that storing a UTF-8 string in a std::wstring will mess up the encoding? That is not my experience...
oystein
@oystein: wstring simply isn't UTF-8. You can store UTF-8 in a std::string, but you must be very careful using string methods such as find.
Roger Pate
@roger: What do you mean it isn't UTF-8? As far as I know it's just a string class implemented with w_chars, encoding should not matter
oystein
Alf P. Steinbach
@oystein: wchar_t can't (reasonably) represent UTF-8 — its entire *raison d'être* is to represent wide characters instead of a multibyte encoding.
Roger Pate
@alf: I don't think I understand what you are talking about, UTF-8 source files compile fine for me...?
oystein
@roger: I don't see why that would be a problem?
oystein
@oystein: regarding compilation, older versions of g++ choked on a BOM (Byte Order Mark) at the start of an UTF-8 source code file. The snag was/is that Visual C++ required the BOM. Cheers,
Alf P. Steinbach
@alf: Ah, ok. I'm using g++ 4.5.1, which seems to handle it without problems
oystein
wstring is a string represented by UNICODE codes which have constant (microsoft thinks in its own way) length of 6 (but usually implemented as 4 or 2)) bytes. UTF-8 is a multybyte representation which encodes UNICODE codes as sequences of 1-6 bytes.
Basilevs
No, wstring is just a basic_string<wchar_t>. Nothing more.
oystein
@oystein: yes, but the whole point of UTF-8 is to encode a code point into 8-bit "chunks". `wchar_t` is specifically intended for dealing with "chunks" that are larger than 8 bits. As such, while you *can* store UTF-8 into a `wchar_t`, it's utterly pointless to do so. `char` is guaranteed to be (at least) 8 bits, which (in turn) guarantees that it will hold UTF-8 data without a problem.
Jerry Coffin
@jerry: The problem is that many common UTF-8 characters use two (or more) "chunks", and such create a major headache when assuming that each element (char) in a std::string is a character, which it won't be in that case. Using a wstring, there is more space in each element, and the probability of an element being a whole character increases.
oystein
@oystein: storing utf-8 in a `wstring` will be exactly identical to storing it in a `string`, except you'll always be wasting a 1 or 3 bytes for every element. The `wchar_t`'s do not magically absorb multi-byte sequences.
Inverse
@Inverse: The amount of bytes wasted would depend on the platform, but yes. The advandage of using wstring is that I can more safely assume that each element contains one character, not e.g. a half one.
oystein
@oystein: That's true *only* if you/your editor actually encodes that character into UTF-16 or UTF-32/UCS-4. Codepoint X converted to UTF-8 will always use the same number of bytes, and they'll always be 8 bits apiece -- storing them into something larger will just waste space. For `wchar_t` to do any good, you need to use UTF-16 or UTF-32/UCS-4 (depending on what size of `wchar_t` your compiler supports -- MS => 16 bits, gcc => 32 bits).
Jerry Coffin
@oystein you can't safely assume that. Microsoft makes use of UTF-16 for theirs wide strings. That means only two bytes per unit and up to six per character.
Basilevs
@basilevs: That's why I said "more safely" - compared to std::string
oystein
@jerry: Not sure if I'm getting what you're trying to say here, but according to http://en.wikipedia.org/wiki/UTF-8 UTF-8 is a varaible length encoding, which in UTF-8's case means that a character could be 1 byte (8 bits) _or more_.
oystein
@jerry implies that wchar_t is not supposed to store multibyte encodings. His claim is true but irrelevant as your code doesn't try do do so. You are working with wide stings only, not myltibyte ones.
Basilevs
@basilevs: I do not get what you are saying, my strings certainly contain multiple bytes :) And UTF-8 is a variable length encoding, which implies that it could be multibyte.
oystein
By multiple I mean codepoint of variable length. Yours are (more or less) of constant length.
Basilevs
http://en.wikipedia.org/wiki/Variable-width_encoding
Basilevs
And you are not using UTF-8 at runtime in your example
Basilevs
@basilevs: No, UTF-8 codepoints are of variable length - am I misunderstanding you completely here? And could you please clarify what you mean by "using UTF-8 at runtime"?
oystein
You are using wide characters at runtime. It is UTF-16 on windows, UCS32 (might be wrong) on Linux. No UTF-8 here. UTF-8 codepoints are of variable length but you are not using it at runtime.
Basilevs
Why so much misunderstanding? It seemed clear to me that the question was about working with wchar_t strings in the program, then automatically converting to UTF-8 on output. I remember a similar question from a couple of days ago with a good answer but I can't find it now.
Mark Ransom
@Basilevs: What do you mean when you say that wide characters are UTF-16 on Windows? In VC++ a w_char is simply a short integer, IIRC. The fact that Windows use UTF-16 internally does not affect the encoding of the string I store in my variable
oystein
@Mark: Lots of misunderstanding here for sure. I don't really see why a conversion is needed, the string is already encoded as UTF-8, I'm just storing it in a wstring. I'm probably doing something fundamentally wrong.
oystein
@oystein, that's part of the misunderstanding - even if your .cpp is in UTF-8, the string is not. It's Unicode all right, but it's in whatever format your compiler generates for wchar_t which most certainly *won't* be UTF-8.
Mark Ransom
@Mark: Ah, now we are starting to make sense. Are you sure about this? Got any references? I was told that the encoding would be determined by the document encoding. Anyway, that does not really change anything.
oystein
I was told that document encoding is left when there is no L prefix.
Basilevs
+1  A: 

There is a (Windows-specific) solution that should work for you here. Basically, convert wstring to UTF-8 codepage and then use ofstream.

#include < windows.h >

std::string to_utf8(const wchar_t* buffer, int len)
{
        int nChars = ::WideCharToMultiByte(
                CP_UTF8,
                0,
                buffer,
                len,
                NULL,
                0,
                NULL,
                NULL);
        if (nChars == 0) return "";

        string newbuffer;
        newbuffer.resize(nChars) ;
        ::WideCharToMultiByte(
                CP_UTF8,
                0,
                buffer,
                len,
                const_cast< char* >(newbuffer.c_str()),
                nChars,
                NULL,
                NULL); 

        return newbuffer;
}

std::string to_utf8(const std::wstring& str)
{
        return to_utf8(str.c_str(), (int)str.size());
}

int main()
{
        std::ofstream testFile;

        testFile.open("demo.xml", std::ios::out | std::ios::binary); 

        std::wstring text =
                L"< ?xml version=\"1.0\" encoding=\"UTF-8\"? >\n"
                L"< root description=\"this is a naïve example\" >\n< /root >";

        std::string outtext = to_utf8(text);

        testFile << outtext;

        testFile.close();

        return 0;
}
Steve Townsend
That's all nice, but I won't know the encoding of my string, and such this won't really help.. Also I need to be cross-platform
oystein
@luke - I did link to that, in the first line of the first version of the response.
Steve Townsend
@Steve, aaaaahhh, I already had the link in my history, so it looked like plain text. Terribly sorry.
luke
@luke - np at all; @oystein - I will leave this here for future reference anyway - sorry it's not useful in your scenario.
Steve Townsend
A: 

Note that wide streams output only char * variables, so maybe you should try using the c_str() member function to convert a std::wstring and then output it to the file. Then it should probably work?

sukhbir
Did not seem to work for me, not with wofstream and not with ofstream
oystein
Aah oops. Sorry for not being helpful.
sukhbir
+4  A: 

Why not write the file as a binary. Just use ofstream with the std::ios::binary setting. The editor should be able to interpret it then. Don't forget the Unicode flag 0xFEFF at the beginning. You might be better of writing with a library, try one of these:

http://www.codeproject.com/KB/files/EZUTF.aspx

http://www.gnu.org/software/libiconv/

http://utfcpp.sourceforge.net/

inf.ig.sh
The problem is that I won't know that this is UTF-8, so I'll have to do without the BOM. But still, I'll see if I can use binary. It's a bit hairy for what I'm doing, though - I'd rather avoid it if possible.
oystein
I have decided to drop unicode support, it's not worth it in my case. Yet, I feel this answer was the closest one to a working solution, so you get the accepted status (at least for now).
oystein
+2  A: 

C++ has means to perform a conversion from wide character to localized ones on output or file write. Use codecvt facet for that purpose.

You may use standard std::codecvt_byname, or a non-standard codecvt_facet implementation.

#include <locale>
using namespace std;
typedef codecvt_facet<wchar_t, char, mbstate_t> Cvt;
locale utf8locale(locale(), new codecvt_byname<wchar_t, char, mbstate_t> ("en_US.UTF-8"));
wcout.imbue(utf8locale);
wcout << L"Hello, wide to multybyte world!" << endl;

Beware that on some platforms codecvt_byname can only emit conversion only for locales that are installed in the system. I therefore recommend to search stackoverflow for "utf8 codecvt " and make a choice from many referenes of custom codecvt implementations listed.

EDIT: As OP states that the string is already encoded, all he should do is to remove prefixes L and "w" from every token of his code.

Basilevs
Actually codecvt might be used to perform any conversions needed, but the most used one and provided by STL are input/output operations.
Basilevs
Yes, but I do not want to convert anything, or am I missing something? The string is already encoded
oystein
Then why are you making compilator to convert it to UNICODE with L prefix? Just output it with narrow streams.
Basilevs
Encoded - means stored in an external encoding. In your case you _write_ in external encoding. Then compiler converts your code to UNICODE, internal encoding and stores that in object module. Therefore if you want to output something you should perform a backward conversion or stop making compiler do the unnecessary.
Basilevs
@basilevs: The L prefix does not magically make the compilator convert it to unicode, it just means that the string is a w_char literal. A wide string.
oystein
Well you sure know better. Might as well post the output of the test program to make me blush.
Basilevs
@basilevs: I'm not trying to be rude or anything. Storing the string as std::string and outputting it with ofstream obviously works. But that does not solve my problem, which is why I created this question in the first place.
oystein
Basilevs
My point is that wide literal IS stored in wide codepoints as string constant on compile time. Therefore there is now way (except some dirty microsoft hacks) to output that const without some kind of conversion (windows allows UTF16 output). Conversion may be done by explicit function call or by imbue of locale needed into wide output stream.
Basilevs
God damn that Microsoft! It's making explanations so much harder!
Basilevs
@basilevs: Well, I'll make it easy for you: take that constant and throw it into the nearest thrash bin - it was just an example :) The point is that I have a string of unknowng encoding (probably UTF-8) stored in a wstring.
oystein
As I mentioned in a comment to another answer, that is almost impossible to do. You can't widen unknown encoding. Widening is a process to make a codepoint take a larger space to ease the processing of data. If you can't widen the input, you should work with it in its raw form. std::string of vector<char> are appropriate containers for that. Narrow streams should be used with unknown encoding.
Basilevs
A: 

I hade the same problem some time ago, and wrote down the solution I found on my blog. You might want to check it out to see if it might help, especially the function wstring_to_utf8.

http://pileborg.org/blog5.php/2010/06/13/unicode-utf-8-and-wchar_t

Joachim Pileborg
Thank you for that, but it's not quite what I'm after, since I do not know what encoding my string will be in. For this example I just picked UTF-8. Also I don't think w_char is guaranteed to be able to contain a 4-byte character (UCS-4)? It is on Linux, but I think Windows users will face some problems here.
oystein
A: 

You should not use UTF-8 encoded source file if you want to write portable code. Sorry.

  std::wstring str = L"abcàdëefŸg€hhhhhhhµa";

(I am not sure if this actually hurts the standard, but I think it is. But even if, to be safe you should not.)

Yes, purely using std::ostream will not work. There are many ways to convert a wstring to UTF-8. My favorite is using the International Components for Unicode. It's a big lib, but it's great. You get a lot of extras and things you might need in the future.

towi
Sorry, I feel people don't get the point of this question, maybe I'm not clear enough. The problem is not UTF-8. This was just an example I picked. I will probably read the (w)string from a file, it could have any encoding. The problem is writing it back to a file.
oystein
I see.Then you probably just have to make sure to open the file in binary mode.
towi
@oystein, Wow, I got your problem now. If you don't know the encoding you can't transform codepoints. If you can't do that, there is no meaning in wchar_t. Top voted answer is sure right.
Basilevs
@towi: Probably, see inf.ig.sh's answer. I might end up with that. @basilevs: There is a reason I'm using wchar_t. I want to do lots of heavy manipulation on that string before I write it back, and have to rely on each element of my string being one whole character. That's not going to be the case with std::string as soon as you step outside the english-speaking world. With wide strings, it'll be likely enough that I can live with it.
oystein
A: 

From my experience of working with different character encodings I would recommend that you only deal with UTF-8 at load and save time. You're in for a world of pain if you try and store the internal representation in UTF-8 since a single character could be anything from 1 byte to 4. So simple operations like strlen require looking at every byte to decide len rather than the allocated buffer (although you can optimize by looking at the first byte in the char sequence, e.g. 00..7f is a single byte char, c2..df indicates a 2 byte char etc).

People quite often refer to 'Unicode strings' when they mean UTF-16 and on Windows a wchar_t is a fixed 2 bytes. In Windows I think wchar_t is simply:

typedef SHORT wchar_t;

The full UTF-32 4 byte representation is rarely required and very wasteful, here what the Unicode Standard (5.0) has to say on it:

"On average more than 99% of all UTF-16 is expressed using single code units... UTF-16 provides the right mix of compact size with the ability to handle the occassional character outside the BMP"

In short, use whcar_t as your internal representation and do conversions when loading and saving (and don't worry about full Unicode unless you know you need it).

With regard to performing the actual conversion have a look at the ICU project:

http://site.icu-project.org/

snowdude
Some sensible words here. I was trying to avoid encodings at all, to be honest, since I really won't know what I'll get thrown at me in this case. That makes doing any conversions difficult. Storing it as a vector<char> (or similar) would mean that I have to make my own string class, and unicode support is _really_ not worth that much coding time. It's starting to look like I'm going to drop unicode support for now, but we'll see.
oystein
(1) It's often more useful to know how many *bytes* are in a string (for memory allocation, disk space, etc.), than it is to know how many *characters* are in a string. For this purpose, `strlen` *does* work correctly for UTF-8.
dan04
(2) It's not true that "most OSes consider a wchar_t as fixed 2 bytes" or as UTF-16. That's a Windows thing, done for backwards compatibility with UCS-2-based older versions of NT. On Linux, `wchar_t` is usually UTF-32. So, for cross-platform code, you either need to use UTF-8 or typedef your own UTF-16 / UTF-32 types. Fortunately, the new C++ standard will have `char16_t` and `char32_t`.
dan04
@dan04 To be honest I spend most of my time in Win world so I can't argue on other OSes. The Unicode Standard (5.0) states "On average more than 99% of all UTF-16 is expressed using single code units... UTF-16 provides the right mix of compact size with the ability to handle the occassional character outside the BMP". That's my main point. With regard to how useful it is to know character sizes rather than byte sizes... try writing any character processing code without knowing character lengths! UTF-8 is great for portability (no byte ordering issues) but not for working in.
snowdude
I've written a *lot* of string-handling code that doesn't care about character lengths. Consider for example, a routine to convert DOS-style line breaks to Unix-style ones. It doesn't matter if the 3 bytes "\xE2\x82\xAC" represent a single character; you're just going to output them unchanged. All you care about is '\r' and '\n' which are the same in UTF-8 as they are in ASCII.
dan04