tags:

views:

1221

answers:

2

I am trying to convert a C++ std::string to UTF-8 or std::wstring without losing information (consider a string that contains non-ASCII characters).

According to http://forums.sun.com/thread.jspa?threadID=486770&forumID=31:

If the std::string has non-ASCII characters, you must provide a function that converts from your encoding to UTF-8 [...]

What encoding does std::string.c_str() use? How can I convert it to UTF-8 or std::wstring in a cross-platform fashion?

+11  A: 

std::string per se uses no encoding -- it will return the bytes you put in it. For example, those bytes might be using ISO-8859-1 encoding... or any other, really: the information about the encoding is just not there -- you have to know where the bytes were coming from!

Alex Martelli
So essentially there is no way for me to convert std::string without knowing its encoding ahead of time? I ask because I'm writing an API function that takes in a std::string. I guess the documentation will need to instruct users what format to pass in.
Gili
@Gili, right: you cannot reliably convert a byte sequence in an unknown encoding to UTF-8 (or anything else;-). I recommend you ask the caller to supply UTF-8 data -- most other encodings don't allow encoding _every_ possible Unicode string. As @Naaff says, ASCII is a special case of UTF-8 (and ISO-8859-* and many other encodings), so if that's your case there's no worry (a footnote in the docs reminding the users of this fact might save _them_ worry;-).
Alex Martelli
Good answer, thank you :)
Gili
ISO-8859-* are no way "special case" of UTF-8. They are simply different single byte encodings.
n0rd
ASCII strings are also UTF-8 strings and ISO-8859-1 strings -).
Alex Martelli
+5  A: 

std::string contains any sequence of bytes, so the encoding is up to you. You must know how it is encoded. However, if you don't know that it is something else, it's probably just ASCII. In which case, it's already UTF-8 compatible.

Naaff
I have seen "it's probably just..." be the source of so many character encoding errors.I suggest never guessing when it comes to character encodings: Always be very explicit in what you take and what you produce. In each case, if you don't spec the character set, then spec an additional parameter/return value to indicate the encoding.
MtnViewMark