views:

360

answers:

8

I frequently work with libraries that use char when working with bytes in C++. The alternative is to define a "Byte" as unsigned char but that not the standard they decided to use. I frequently pass bytes from C# into the C++ dlls and cast them to char to work with the library.

When casting ints to chars or chars to other simple types what are some of the side effects that can occur. Specifically, when has this broken code that you have worked on and how did you find out it was because of the char signedness?

Lucky i haven't run into this in my code, used a char signed casting trick back in an embedded systems class in school. I'm looking to better understand the issue since I feel it is relevant to the work I am doing.

+1  A: 

The one that most annoys me:

typedef char byte;

byte b = 12;

cout << b << endl;

Sure it's cosmetics, but arrr...

Kornel Kisielewicz
shouldn't that be `typedef char byte`?
roe
@roe I get confused by typedef almost all the time and I write it the way around like what Kornel did also :P
AraK
@roe -- yeah, another thing that annoys me :>
Kornel Kisielewicz
@AraK, in my case it's because I'm from a Pascal background -- `type Byte = Char;` makes more sense :P
Kornel Kisielewicz
This must be `typedef unsigned char byte;`!!! The `char` type is *not* guaranteed to be signed/unsigned. That's why GCC has command-line options to define the behavior.
AndiDog
@AndiDog, the signedness specifier was ommited on purpose :)
Kornel Kisielewicz
@AraK: Just remember that `typedef` syntax matches variable declarations: actual type first, then the name.
jamesdlin
+1  A: 

I've been bitten by char signedness in writing search algorithms that used characters from the text as indices into state trees. I've also had it cause problems when expanding characters into larger types, and the sign bit propagates causing problems elsewhere.

I found out when I started getting bizarre results, and segfaults arising from searching texts other than the one's I'd used during the initial development (obviously characters with values >127 or <0 are going to cause this, and won't necessarily be present in your typical text files.

Always check a variable's signedness when working with it. Generally now I make types signed unless I have a good reason otherwise, casting when necessary. This fits in nicely with the ubiquitous use of char in libraries to simply represent a byte. Keep in mind that the signedness of char is not defined (unlike with other types), you should give it special treatment, and be mindful.

Matt Joiner
+4  A: 

One major risk is if you need to shift the bytes. A signed char keeps the sign-bit when right-shifted, whereas an unsigned char doesn't. Here's a small test program:

#include <stdio.h>

int main (void)
{
    signed char a = -1;
    unsigned char b = 255;

    printf("%d\n%d\n", a >> 1, b >> 1);

    return 0;
}

It should print -1 and 127, even though a and b start out with the same bit pattern (given 8-bit chars, two's-complement and signed values using arithmetic shift).

In short, you can't rely on shift working identically for signed and unsigned chars, so if you need portability, use unsigned char rather than char or signed char.

Vatine
Read this: http://stackoverflow.com/editing-help
avakar
You assume two's complement in addition to CHAR_BIT being 8, but right shifting a negative value is implementation-defined anyway. (An implementation can treat it the same as unsigned or different, and be following the standard either way.)
Roger Pate
Vatine
Vatine, I was referring to those `<pre>` tags. Indent text by four spaces to turn it into code block. You can use the button with ones and zeros to indent the text. You rarely need to use HTML tags on Stack Overflow. And *do* read the page I linked to.
avakar
A: 

When casting ints to chars or chars to other simple types

The critical point is, that casting a signed value from one primitive type to another (larger) type does not retain the bit pattern (assuming two's complement). A signed char with bit pattern 0xff is -1, while a signed short with the decimal value -1 is 0xffff. Casting an unsigned char with value 0xff to a unsigned short, however, yields 0x00ff. Therefore, always think of proper signedness before you typecast to a larger or smaller data type. Never carry unsigned data in signed data types if you don't need to - if an external library forces you to do so, do the conversion as late as possible (or as early as possible if the external code acts as data source).

Alexander Gessler
+1  A: 

You will fail miserably when compiling for multiple platforms because the C++ standard doesn't define char to be of a certain "signedness".

Therefore GCC introduces -fsigned-char and -funsigned-char options to force certain behavior. More on that topic can be found here, for example.

EDIT:

As you asked for examples of broken code, there are plenty of possibilities to break code that processes binary data. For example, image you process 8-bit audio samples (range -128 to 127) and you want to halven the volume. Now imagine this scenario (in which the naive programmer assumes char == signed char):

char sampleIn;

// If the sample is -1 (= almost silent), and the compiler treats char as unsigned,
// then the value of 'sampleIn' will be 255
read_one_byte_sample(&sampleIn);

// Ok, halven the volume. The value will be 127!
char sampleOut = sampleOut / 2;

// And write the processed sample to the output file, for example.
// (unsigned char)127 has the exact same bit pattern as (signed char)127,
// so this will write a sample with the loudest volume!!
write_one_byte_sample_to_output_file(&sampleOut);

I hope you like that example ;-) But to be honest I've never really came across such problems, not even as a beginner as far as I can remember...

Hope this answer is sufficient for you downvoters. What about a short comment?

AndiDog
+1  A: 

The C and C++ language specifications define 3 data types for holding characters: char, signed char and unsigned char. The latter 2 have been discussed in other answers. Let's look at the char type.

The standard(s) say that the char data type may be signed or unsigned and is an implementation decision. This means that some compilers or versions of compilers, can implement char differently. The implication is that the char data type is not conducive for arithmetic or Boolean operations. For arithmetic and Boolean operations, signed and unsigned versions of char will work fine.

In summary, there are 3 versions of char data type. The char data type performs well for holding characters, but is not suited for arithmetic across platforms and translators since it's signedness is implementation defined.

Thomas Matthews
A: 

The most obvious gotchas come when you need to compare the numeric value of a char with a hexadecimal constant when implementing protocols or encoding schemes.

For example, when implementing telnet you might want to do this.

// Check for IAC (hex FF) byte
if (ch == 0xFF)
{
    // ...

Or when testing for UTF-8 multi-byte sequences.

if (ch >= 0x80)
{
    // ...

Fortunately these errors don't usually survive very long as even the most cursory testing on a platform with a signed char should reveal them. They can be fixed by using a character constant, converting the numeric constant to a char or converting the character to an unsigned char before the comparison operator promotes both to an int. Converting the char directly to an unsigned won't work, though.

if (ch == '\xff')               // OK

if ((unsigned char)ch == 0xff)  // OK, so long as char has 8-bits

if (ch == (char)0xff)           // Usually OK, relies on implementation defined behaviour

if ((unsigned)ch == 0xff)       // still wrong
Charles Bailey
A: 

Sign extension. The first version of my URL encoding function produced strings like "%FFFFFFA3".

dan04