views:

372

answers:

5

Hi all!

let's say i have a char array like "äa". is there a way to get the ascii value (e.g 228) of the first char, which is a multibyte? even if i cast my array to a wchar_t * array, i'm not able to get the ascii value of "ä", because its 2 bytes long. is there a way to do this, im trying for 2 days now :(

i'm using gcc.

thanks!

+3  A: 
unwind
unicode codes for characters from ascii range are same as in ascii encoding. so in case of wchar you can take least significant byte and have ascii code. actually author means something else than that, i explained in my answer
Andrey
A: 

what you want is called transliteration - converting letters of one language to another. it has nothing about unicode and wchars. you need to have a table of mapping.

Andrey
+2  A: 

You are very confused. ASCII only has values smaller than 128. Value 228 corresponds to ä in 8 bit character sets ISO-8859-1, CP1252 and some others. It also is the UCS value of ä in the Unicode system. If you use string literal "ä" and get a string of two characters, the string is in fact encoded in UTF-8 and you may wish to parse the UTF-8 coding to acquire Unicode UCS values.

More likely what you really want to do is converting from one character set to another. How to do this heavily depends on your operating system, so more information is required. You also need to specify what exactly you want? A std::string or char* of ISO-8859-1, perhaps?

Tronic
+1  A: 

Depends on the encoding used in your char array.

If your char array is Latin 1 encoded, then it it 2 bytes long (plus maybe a NUL terminator, we don't care), and those 2 bytes are:

  • 0xE4 (lower-case a umlaut)
  • 0x61 (lower-case a).

Note that Latin 1 is not ASCII, and 0xE4 is not an ASCII value, it's a Latin 1 (or Unicode) value.

You would get the value like this:

int i = (unsigned char) my_array[0];

If your char array is UTF-8 encoded, then it is three bytes long, and those bytes are:

  • binary 11000011 (first byte of UTF-8 encoded 0xE4)
  • binary 10100100 (second byte of UTF-8 encoded 0xE4)
  • 0x61 (lower-case a)

To recover the Unicode value of a character encoded with UTF-8, you either need to implement it yourself based on http://en.wikipedia.org/wiki/UTF-8#Description (usually a bad idea in production code), or else you need to use a platform-specific unicode-to-wchar_t conversion routine. On linux this is mbstowcs or iconv, although for a single character you can use mbtowc provided that the multi-byte encoding defined for the current locale is in fact UTF-8:

wchar_t i;
if (mbtowc(&i, my_array, 3) == -1) {
    // handle error
}

If it's SHIFT-JIS then this doesn't work...

Steve Jessop
+1  A: 

There is a standard C++ template function to do that conversion, ctype::narrow(). It is part of the localization library. It will convert the wide character to the equivalent char value for you current local, if possible. As the other answers have pointed out, there isn't always a mapping, which is why ctype::narrow() takes a default character that it will return if there is no mapping.

shf301