views:

369

answers:

3

How to convert an ASCII std::string to an UTF8 (Unicode) std::string in C++?

A: 

I assume that by ASCII you mean CP1252 or other 8 bit character set (ASCII is only 7 bits and it is directly compatible with UTF-8, no conversion required). Standard C++ cannot do it. You need e.g. Glibmm, Qt, iconv or WINAPI to do it.

Tronic
That's a big assumption. CP1252 is very platform specific and there is no indication of platform in the question.
Martin York
That's why I said "or other". However, it seems that Windows users are most ignorant about character sets. One big benefit of assuming CP1252, when converting from 8 bit, is that it is also compatible with ISO-8859-1 (but not the other way around).
Tronic
What a bizarre comment. Code page 1252 is very much a Windows specific encoding. Saying that Windows users are "most ignorant" about Windows specific implementation details is, erm, ignorant.
Hans Passant
It is still probably the most common 8 bit character coding these days. It is also compatible with ISO-8859-1, which happens to be the most standardized coding. Even some UNIX programs (e.g. Irssi) default to CP1252 for conversions because of those two reasons. UNIX users are generally less ignorant because they have to deal with UTF-8 and older character codings all the time (or at least had to, a few years ago. Windows developers on the other hand often call all 8 bit encodings ANSI (as if it was only one character set) or even ASCII (as if it was 8 bit).
Tronic
+2  A: 
std::string ASCIIToUTF8(std::string str) {
  return str;
}

Every ASCII character has the same representation in UTF8, so there is nothing to convert. Of course, if the input string uses an extended (8-bit) ASCII character set, the answer is more complex.

jalf
Can I convert an ASCII string to an Unicode string.
Eduardo
The term "Extended ASCII" has been mostly used only for CP437 (or other MS-DOS codepage), which is nearly extinct these days.
Tronic
@Tronic: True, but ultimately, any 8-bit character set that is a superset of ASCII is an extended ASCII character set. :)
jalf
@Eduardo: Which kind of unicode? An ASCII string is already a perfectly valid UTF8 unicode string. Unicode defines several different encodings.
jalf
+1  A: 

ASCII is a seven-bit encoding and maps identically onto the UTF-8 encoding of the subset of characters that can be represented in ASCII.

In short, there is nothing to do. Your ASCII string is already valid UTF-8.

Charles Bailey