ansaurus

Question

Char C question about encoding signed/unsigned.

Answer 1

A:

Not really, unsigned / signed does not specify how many values a variable can hold. It specifies how they are interpreted.

So, an unsigned char has the same amount of values as a signed char, except that the one has negative numbers and the other doesn't. It is still 8 bits (if we assume that a char holds 8 bits, I'm not sure it does everywhere).

Daren Thomas 2010-03-26 15:07:42

Standard C guarantees that a character holds at least 8 bits; there are few 9-bit or 10-bit character machines left.

Jonathan Leffler 2010-03-26 15:23:55

Answer 2

+2 A:

signed / unsigned affect only arithmetic operations. if char is unsigned then higher values will be positive. in case of signed they will be negative. But range is same still.

Andrey 2010-03-26 15:08:08

Answer 3

+1 A:

It makes no differences when using a char* as a string. The only time signed/unsigned would make a difference is if you would be interpreting it as a number, like for arithmetic or if you were to print it as an integer.

Graphics Noob 2010-03-26 15:09:11

It can also make a difference if you're comparing characters. For example, in the UTF8 case, the 'flag' characters will generally be negative if `char` is signed. If your code isn't prepared for that, things will break.

Michael Burr 2010-03-26 15:16:22

Can you explain it a little more?

drigoSkalWalker 2010-03-26 15:24:11

@Michael Burr I didn't know that, do you have a reference?

Graphics Noob 2010-03-26 15:34:52

re: negative flag chars, this would only be the case if you're actually writing the UTF8 en/decoder yourself. If that's a black box then a bunch of bytes is all you know about what goes in/out.

quixoto 2010-03-26 16:45:10

@Graphics and drigoSkalWalker: I've expanded on my comment here: http://stackoverflow.com/questions/2524226/char-c-question-about-encoding-signed-unsigned/2525010#2525010

Michael Burr 2010-03-26 17:04:47

Michael's point is that you can't rely on `highValuedCharacter > lowValuedCharacter`, since high values wrap around to negative with signed chars. As an obvious example, you can't check if a UTF-8 character is non-ASCII by checking for `> 127`, because of course there are no signed characters in that range.

Chuck 2010-03-26 17:05:01

Answer 4

A:

UTF-8 characters cannot be assumed to store in one byte. UTF-8 characters can be 1-4 bytes wide. So, a char, wchar_t, signed or unsigned would not be sufficient for assuming one unit can always store one UTF-8 character.

Most platforms (such as PHP, .NET, etc.) have you build strings normally (such as char[] in C) and you use a library to convert between encodings and parse characters out of the string.

spoulson 2010-03-26 15:17:52

yes, It is obvious that I need an ARRAY of chars, but my question is about signed and unsigned, think if I have a singed or unsigned ARRAY of chars can be it make my program run wrong?

drigoSkalWalker 2010-03-26 15:21:39

The other answers are correct in saying that signed/unsigned does not change the size of data being stored. My concern was just that UTF-8 can be more than one byte for UTF-8 characters from Kanji, Arabic, etc.

spoulson 2010-03-26 15:37:09

Answer 5

+1 A:

I've had a couple requests to explain a comment I made.

The fact that a char type can default to wither a signed or unsigned type can be significant when you're comparing characters and expect a certain ordering. In particular, UTF8 uses the high bit (assuming that char is an 8-bit type, which is true in the vast majority of platforms) to indicate that a character code point requires more than one byte to be represented.

A quick and dirty example of the problem:

#include <stdio.h>
int main( void)
{
    signed char flag = 0xf0;
    unsigned char uflag = 0xf0;

    if (flag < (signed char) 'z') {
        printf( "flag is smaller than 'z'\n");
    }
    else {
        printf( "flag is smaller than 'z'\n");
    }    


    if (uflag < (unsigned char) 'z') {
        printf( "uflag is smaller than 'z'\n");
    }
    else {
        printf( "uflag is larger than 'z'\n");
    }
    return 0;
}

On most projects that I work, the unadorned char type is typically avoided in favor us using a typedef that explicitly specifies an unsigned char. Something like the uint8_t from stdint.h or

typedef unsigned char u8;

Generally dealing with an unsigned char type seems to work well and have few problems - the one area that I have seen occasional problems is when using something of that type to control a loop:

while (uchar_var-- >= 0) {
    // infinite loop...
}

Michael Burr 2010-03-26 16:42:18

Answer 6

A:

Two things:

Whether a char type is signed or unsigned won't affect your ability to translate UTF8-encoded-strings to and from whatever display string type you're using (WCHAR or whatnot). Don't worry about it, in other words: the UTF8 bytes are just bytes, and whatever you're using as an encoder/decoder will do the right thing.
Some of your confusion may be that you're trying to do this:
```
unsigned char *string = "This is a UTF8 string";
```
Don't do this-- you're mixing different concepts. A UTF-8 encoded string is just a sequence of bytes. C string literals (as above) were not really designed to represent this; they're designed to represent "ASCII-encoded" strings. Although for some cases (like mine here) they end up being the same thing, in your example in the question, they may not. And certainly in other cases they won't be. Load your Unicode strings from an external resource. In general I'd be wary of embedding non-ASCII characters in a .c source file; even if the compiler knows what to do with them, other software in your toolchain may not.

quixoto 2010-03-26 16:55:48

Answer 7

A:

Using unsigned char has its pros and cons. The biggest benefits are that you don't get sign extension or other funny features such as signed overflow that would produce unexpected results from calculations. Unsigned char is also compatible with <cctype> macros/functions such as isalpha(ch) (all these require values in unsigned char range). On the other hand, all I/O functions require char*, requiring you to cast whenever you do I/O.

As for UTF-8, storing it in signed or unsigned arrays is fine but you have to be careful with those string literals as there is little guarantee about them being valid UTF-8. C++0x adds UTF-8 string literals to avoid possible issues and I would expect the next C standard to adopt those as well.

In general you should be fine, though, as long as you make sure that your source code files are always UTF-8 encoded.

Tronic 2010-03-26 18:13:24

ansaurus

tags:

views:

answers:

Char C question about encoding signed/unsigned.

related questions