How to convert an ASCII std::string to an UTF8 (Unicode) std::string in C++?
A:
I assume that by ASCII you mean CP1252 or other 8 bit character set (ASCII is only 7 bits and it is directly compatible with UTF-8, no conversion required). Standard C++ cannot do it. You need e.g. Glibmm, Qt, iconv or WINAPI to do it.
Tronic
2010-02-27 15:16:25
That's a big assumption. CP1252 is very platform specific and there is no indication of platform in the question.
Martin York
2010-02-27 15:20:05
That's why I said "or other". However, it seems that Windows users are most ignorant about character sets. One big benefit of assuming CP1252, when converting from 8 bit, is that it is also compatible with ISO-8859-1 (but not the other way around).
Tronic
2010-02-27 15:23:06
What a bizarre comment. Code page 1252 is very much a Windows specific encoding. Saying that Windows users are "most ignorant" about Windows specific implementation details is, erm, ignorant.
Hans Passant
2010-02-27 15:59:56
It is still probably the most common 8 bit character coding these days. It is also compatible with ISO-8859-1, which happens to be the most standardized coding. Even some UNIX programs (e.g. Irssi) default to CP1252 for conversions because of those two reasons. UNIX users are generally less ignorant because they have to deal with UTF-8 and older character codings all the time (or at least had to, a few years ago. Windows developers on the other hand often call all 8 bit encodings ANSI (as if it was only one character set) or even ASCII (as if it was 8 bit).
Tronic
2010-02-27 17:06:49
+2
A:
std::string ASCIIToUTF8(std::string str) {
return str;
}
Every ASCII character has the same representation in UTF8, so there is nothing to convert. Of course, if the input string uses an extended (8-bit) ASCII character set, the answer is more complex.
jalf
2010-02-27 15:17:56
The term "Extended ASCII" has been mostly used only for CP437 (or other MS-DOS codepage), which is nearly extinct these days.
Tronic
2010-02-27 17:13:40
@Tronic: True, but ultimately, any 8-bit character set that is a superset of ASCII is an extended ASCII character set. :)
jalf
2010-02-27 23:52:48
@Eduardo: Which kind of unicode? An ASCII string is already a perfectly valid UTF8 unicode string. Unicode defines several different encodings.
jalf
2010-02-27 23:53:49
+1
A:
ASCII is a seven-bit encoding and maps identically onto the UTF-8 encoding of the subset of characters that can be represented in ASCII.
In short, there is nothing to do. Your ASCII string is already valid UTF-8.
Charles Bailey
2010-02-27 15:18:44