tags:

views:

261

answers:

7

Hi guys.

I read that C not define if a char is signed or unsigned, and in GCC page this says that it can be signed on x86 and unsigned in PowerPPC and ARM.

Okey, I'm writing a program with GLIB that define char as gchar (not more than it, only a way for standardization).

My question is, what about UTF-8? It use more than an block of memory?

Say that I have a variable

unsigned char *string = "My string with UTF8 enconding ~> çã";

See, if I declare my variable as

unsigned

I will have only 127 values (so my program will to store more blocks of mem) or the UTF-8 change to negative too?

Sorry if I can't explain it correctly, but I think that i is a bit complex.

NOTE: Thanks for all answer

I don't understand how it is interpreted normally.

I think that like ascii, if I have a signed and unsigned char on my program, the strings have diferently values, and it leads to confuse, imagine it in utf8 so.

A: 

Not really, unsigned / signed does not specify how many values a variable can hold. It specifies how they are interpreted.

So, an unsigned char has the same amount of values as a signed char, except that the one has negative numbers and the other doesn't. It is still 8 bits (if we assume that a char holds 8 bits, I'm not sure it does everywhere).

Daren Thomas
Standard C guarantees that a character holds at least 8 bits; there are few 9-bit or 10-bit character machines left.
Jonathan Leffler
+2  A: 

signed / unsigned affect only arithmetic operations. if char is unsigned then higher values will be positive. in case of signed they will be negative. But range is same still.

Andrey
+1  A: 

It makes no differences when using a char* as a string. The only time signed/unsigned would make a difference is if you would be interpreting it as a number, like for arithmetic or if you were to print it as an integer.

Graphics Noob
It can also make a difference if you're comparing characters. For example, in the UTF8 case, the 'flag' characters will generally be negative if `char` is signed. If your code isn't prepared for that, things will break.
Michael Burr
Can you explain it a little more?
drigoSkalWalker
@Michael Burr I didn't know that, do you have a reference?
Graphics Noob
re: negative flag chars, this would only be the case if you're actually writing the UTF8 en/decoder yourself. If that's a black box then a bunch of bytes is all you know about what goes in/out.
quixoto
@Graphics and drigoSkalWalker: I've expanded on my comment here: http://stackoverflow.com/questions/2524226/char-c-question-about-encoding-signed-unsigned/2525010#2525010
Michael Burr
Michael's point is that you can't rely on `highValuedCharacter > lowValuedCharacter`, since high values wrap around to negative with signed chars. As an obvious example, you can't check if a UTF-8 character is non-ASCII by checking for `> 127`, because of course there are no signed characters in that range.
Chuck
A: 

UTF-8 characters cannot be assumed to store in one byte. UTF-8 characters can be 1-4 bytes wide. So, a char, wchar_t, signed or unsigned would not be sufficient for assuming one unit can always store one UTF-8 character.

Most platforms (such as PHP, .NET, etc.) have you build strings normally (such as char[] in C) and you use a library to convert between encodings and parse characters out of the string.

spoulson
yes, It is obvious that I need an ARRAY of chars, but my question is about signed and unsigned, think if I have a singed or unsigned ARRAY of chars can be it make my program run wrong?
drigoSkalWalker
The other answers are correct in saying that signed/unsigned does not change the size of data being stored. My concern was just that UTF-8 can be more than one byte for UTF-8 characters from Kanji, Arabic, etc.
spoulson
+1  A: 

I've had a couple requests to explain a comment I made.

The fact that a char type can default to wither a signed or unsigned type can be significant when you're comparing characters and expect a certain ordering. In particular, UTF8 uses the high bit (assuming that char is an 8-bit type, which is true in the vast majority of platforms) to indicate that a character code point requires more than one byte to be represented.

A quick and dirty example of the problem:

#include <stdio.h>
int main( void)
{
    signed char flag = 0xf0;
    unsigned char uflag = 0xf0;

    if (flag < (signed char) 'z') {
        printf( "flag is smaller than 'z'\n");
    }
    else {
        printf( "flag is smaller than 'z'\n");
    }    


    if (uflag < (unsigned char) 'z') {
        printf( "uflag is smaller than 'z'\n");
    }
    else {
        printf( "uflag is larger than 'z'\n");
    }
    return 0;
}

On most projects that I work, the unadorned char type is typically avoided in favor us using a typedef that explicitly specifies an unsigned char. Something like the uint8_t from stdint.h or

typedef unsigned char u8;

Generally dealing with an unsigned char type seems to work well and have few problems - the one area that I have seen occasional problems is when using something of that type to control a loop:

while (uchar_var-- >= 0) {
    // infinite loop...
}
Michael Burr
A: 

Two things:

  1. Whether a char type is signed or unsigned won't affect your ability to translate UTF8-encoded-strings to and from whatever display string type you're using (WCHAR or whatnot). Don't worry about it, in other words: the UTF8 bytes are just bytes, and whatever you're using as an encoder/decoder will do the right thing.

  2. Some of your confusion may be that you're trying to do this:

    unsigned char *string = "This is a UTF8 string";
    

    Don't do this-- you're mixing different concepts. A UTF-8 encoded string is just a sequence of bytes. C string literals (as above) were not really designed to represent this; they're designed to represent "ASCII-encoded" strings. Although for some cases (like mine here) they end up being the same thing, in your example in the question, they may not. And certainly in other cases they won't be. Load your Unicode strings from an external resource. In general I'd be wary of embedding non-ASCII characters in a .c source file; even if the compiler knows what to do with them, other software in your toolchain may not.

quixoto
A: 

Using unsigned char has its pros and cons. The biggest benefits are that you don't get sign extension or other funny features such as signed overflow that would produce unexpected results from calculations. Unsigned char is also compatible with <cctype> macros/functions such as isalpha(ch) (all these require values in unsigned char range). On the other hand, all I/O functions require char*, requiring you to cast whenever you do I/O.

As for UTF-8, storing it in signed or unsigned arrays is fine but you have to be careful with those string literals as there is little guarantee about them being valid UTF-8. C++0x adds UTF-8 string literals to avoid possible issues and I would expect the next C standard to adopt those as well.

In general you should be fine, though, as long as you make sure that your source code files are always UTF-8 encoded.

Tronic