tags:

views:

140

answers:

3

A friend of mine showed me a situation where reading characters produced unexpected behaviour. Reading the character '¤' caused his program to crash. I was able to conclude that '¤' is 164 decimal so it's over the ASCII range.

We noticed the behaviour on '¤' but any character >127 seems to show the problem. The question is how would we reliably read such characters char by char?

int main(int argc, const char *argv[])
{
    char input;
    do
    {
        cin >> input;
        cout << input;
        cout << " " << setbase(10) << (int)input;
        cout << " 0x" << setbase(16) << (int)input;

        cout << endl;
    } while(input);
    return 0;
}


masse@libre:temp/2009-11-30 $ ./a.out 
¤
 -62 0xffffffc2
¤ -92 0xffffffa4
+1  A: 

It is hard to tell why your friend's program is crashing without seeing the code, but it could be because you are using the char as an index into an array. Since characters outside of the regular ASCII range will overflow the limit of a signed char, the char will end up negative.

A. Levy
+1  A: 

declare 'input' as unsigned char instead

Anders K.
I get nearly the same behaviour. Â 194 0xc2 ¤ 164 0xa4I still get two prints, although the second one is correct.
Masse
I seemed to have missed the UTF-8 tag in your posting. sorry.
Anders K.
+2  A: 

Your system is using UTF-8 character encoding (as it should) so the character '¤' causes your program to read the sequence of bytes C2 A4. Since a char is one byte, it reads them one at a time. Look into the wchar_t and the corresponding wcin and wcout streams to read multibyte characters, although I don't know which encodings they support or how they play with locales.

Also, your program is outputting invalid UTF-8, so you really shouldn't be seeing those two characters — I get question marks on my system.

(This is a nitpick and somewhat offtopic, but your while(input) should be while(cin), otherwise you'll get an infinite loop.)

jleedev
With wchar_t, wcin and wcout I get an infinite loop when handling unicode characters.
Masse
If you didn't change your `while(input)`, you're going to get an infinite loop anyway.
jleedev
Yep fixed it. However even with while(input) I got characters alright; only if I tried to give EOF to the software, it resulted in infinite loop. With wchar I got infinite loop with every nonascii character.
Masse