tags:

views:

91

answers:

4

How can I find out what the current charset is in C++?

In a console application (WinXP) I am getting negative values for some characters (like äöüé) with

(int)mystring[a]

and this surprises me. I was expecting the values to be between 127 and 256.

So is there something like GetCharset() or SetCharset() in c++?

+5  A: 

It depends on how you look at the value you have at hand. char can be signed(e.g. on Windows), or unsigned like on some other systems. So, what you should do is to print the value as unsigned to get what you are asking for.

C++ until now is char-set agnostic. For Windows console specifically, you can use: GetConsoleOutputCP.

AraK
I am taking this answer as correct, because you answered the first question. The rest of the mystery still remains a mystery... It is NOT about signed or unsigned int...
Stef
+1  A: 

Look at std::numeric_limits<char>::min() and max(). Or CHAR_MIN and CHAR_MAX if you don't like typing, or if you need an integer constant expression.

If CHAR_MAX == UCHAR_MAX and CHAR_MIN == 0 then chars are unsigned (as you expected). If CHAR_MAX != UCHAR_MAX and CHAR_MIN < 0 they are signed (as you're seeing).

In the standard 3.9.1/1, ensures that there are no other possibilities: "... a plain char can take on either the same values as a signed char or an unsigned char; which one is implementation-defined."

This tells you whether char is signed or unsigned, and that's what's confusing you. You certainly can't call anything to modify it: from the POV of a program it's baked into the compiler even if the compiler has ways of changing it (GCC certainly does: -fsigned-char and -funsigned-char).

The usual way to deal with this is if you're going to cast a char to int, cast it through unsigned char first. So in your example, (int)(unsigned char)mystring[a]. This ensures you get a non-negative value.

It doesn't actually tell you what charset your implementation uses for char, but I don't think you need to know that. On Microsoft compilers, the answer is essentially that commonly-used character encoding "ISO-8859-mutter-mutter". This means that chars with 7-bit ASCII values are represented by that value, while values outside that range are ambiguous, and will be interpreted by a console or other recipient according to how that recipient is configured. ISO Latin 1 unless told otherwise.

Properly speaking, the way characters are interpreted is locale-specific, and the locale can be modified and interrogated using a whole bunch of stuff towards the end of the C++ standard that personally I've never gone through and can't advise on ;-)

Note that if there's a mismatch between the charset in effect, and the charset your console uses, then you could be in for trouble. But I think that's separate from your issue: whether chars can be negative or not is nothing to do with charsets, just whether char is signed.

Steve Jessop
A: 

The only gurantee that the standard provides are for members of the basic character set:

2.2 Character sets

3 The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. The execution character set and the execution wide-character set are supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets are implementation-defined, and any additional members are locale-specific

Further, the type char is supposed to hold:

3.9.1 Fundamental types

1 Objects declared as characters (char) shall be large enough to store any member of the implementation’s basic character set.

So, no gurantees whethere you will get the correct value for the characters you have mentioned. However, try to use an unsigned int to hold this value (for all practical purposes, it never makes sense to use a signed type to hold char values ever, if you are going to print them/pass around).

dirkgently
"it never makes sense to use a signed type to hold char values ever" unfortunately, all the C standard library functions for handling characters do exactly that.
Steve Jessop
They do, but you are well-advised to use `toupper((unsigned char)c);` where `int c = getchar();` and so on ...
dirkgently
Agreed (see my answer). You have to introduce an unsigned type at some point, all I'm quibbling with is whether it should be `unsigned int` to hold the value (perfectly sensible all else being equal), or `unsigned char` as a stepping stone on the way to `int` (C-library-idiom).
Steve Jessop
A: 

chars are normally signed by default. Try this.

cout << (unsigned char) mystring[a] << endl;
EvilTeach