views:

269

answers:

4

After reading this question on signed/unsigned compares (they come up every couple of days I'd say):

I wondered why we don't have proper signed unsigned compares and instead this horrible mess? Take the output from this small program:

#include <stdio.h>
#define C(T1,T2)\
 {signed   T1 a=-1;\
 unsigned T2 b=1;\
  printf("(signed %5s)%d < (unsigned %5s)%d = %d\n",#T1,(int)a,#T2,(int)b,(a<b));}\

 #define C1(T) printf("%s:%d\n",#T,(int)sizeof(T)); C(T,char);C(T,short);C(T,int);C(T,long);
int main()
{
 C1(char); C1(short); C1(int); C1(long); 
}

Compiled with my standard compiler (gcc, 64bit), I get this:

char:1
(signed  char)-1 < (unsigned  char)1 = 1
(signed  char)-1 < (unsigned short)1 = 1
(signed  char)-1 < (unsigned   int)1 = 0
(signed  char)-1 < (unsigned  long)1 = 0
short:2
(signed short)-1 < (unsigned  char)1 = 1
(signed short)-1 < (unsigned short)1 = 1
(signed short)-1 < (unsigned   int)1 = 0
(signed short)-1 < (unsigned  long)1 = 0
int:4
(signed   int)-1 < (unsigned  char)1 = 1
(signed   int)-1 < (unsigned short)1 = 1
(signed   int)-1 < (unsigned   int)1 = 0
(signed   int)-1 < (unsigned  long)1 = 0
long:8
(signed  long)-1 < (unsigned  char)1 = 1
(signed  long)-1 < (unsigned short)1 = 1
(signed  long)-1 < (unsigned   int)1 = 1
(signed  long)-1 < (unsigned  long)1 = 0

If I compile for 32 bit, the result is the same except that:

long:4
(signed  long)-1 < (unsigned   int)1 = 0

The "How?" of all this is easy to find: Just goto section 6.3 of the C99 standard or chapter 4 of C++ and dig up the clauses which describe how the operands are converted to a common type and this can break if the common type reinterprets negative values.

But what about the "Why?". As we can see, the '<' fails in 50% of all cases, also it depends on the concrete sizes of the types, so it is platform dependent. Here are some points to consider:

  • The convert & compare process is not really a prime example for the Rule of Least Surprise

  • I don't believe that there is code out there, which relies on the proposition that (short)-1 > (unsigned)1 and is not written by terrorists.

  • This is all terrible when you're in C++ with template code, because you need type trait magic to knit a correct "<".


After all, comparing signed and unsigned value of different types is easy to implement:

signed X < unsigned Y -> (a<(X)0) || ((Z)a<(Z)b) where Z=X|Y 

The pre-check is cheap and can also be optimized away by the compiler if a>=0 can statically be proven.

So here's my question:

Would it break the language or existing code if we'd add safe signed/unsigned compares to C/C++?

("Would it break the language" means would we need to make massive changes to different parts of the language to accommodate this change)


UPDATE: I've ran this on my good old Turbo-C++ 3.0 and got this output:

char:1
(signed  char)-1 < (unsigned  char)1 = 0

Why is (signed char)-1 < (unsigned char) == 0 here?

+7  A: 

I do not think it would break the language, but yes, it could break some existing code (and the breakage would be probably hard to detect at the compiler level).

There exists a lot more code written in C and C++ than you and I together can imagine (some of it may be even written by terrorists).

Relying on "proposition that (short)-1 > (unsigned)1" may be done unintentionally by someone. There exists a lot of C code dealing with complex bit manipulation and similar things. It is quite possible some programmer may be using the current comparison behaviour in such code. (Other people have already provided nice examples of such code, and a the code is even simpler than I would expect).

Current solution is to warn on such comparisons instead, and leave the solution to the programmer, which I think is in a spirit how C and C++ works. Also, solving it on a compiler level would incur a performance penalty, and this is something C and C++ programmers are extremely sensitive at. Two tests instead of one might seem like a minor issue to you, but there is probably plenty of C code where this would be an issue. It could be solved e.g. by forcing the previous behaviour by using explicit casts to a common data type - but this again would require programmer attention, therefore it is no better than a simple warning.

Suma
There's no reason that this couldn't be added to a later spec and old code continue to compile at C89 or C99
Peter Gibson
Sure there is - the language is not supposed to gratuitously break things from version to version. What do you do when you're using a library that expects one behavior with an application that expects the other? (Especially if part of the library code is macros or inline functions in headers?) Sure there will sometimes be minor breakage, but something as fundamental as this should never be touched!
R..
+6  A: 

Yes it would break the language/existing code. The language, as you have noted, carefully specifies the behavior when signed and unsigned operands are used together. This behavior with comparison operators is essential for some important idioms, like:

if (x-'0' < 10U)

Not to mention things like (equality comparison):

size_t l = mbrtowc(&wc, s, n, &state);
if (l==-1) ... /* Note that mbrtowc returns (size_t)-1 on failure */

As an aside, specifying "natural" behavior for mixed signed/unsigned comparisons would also incur a significant performance penalty, even in programs which are presently using such comparisons in safe ways where they already have their "natural" behavior due to constraints on the input which the compiler would have a hard time determining (or might not be able to determine at all). In writing your own code to handle these tests, I'm sure you've already seen what the performance penalty would look like, and it's not pretty.

R..
Another common example: `if (snprintf(buf, sizeof buf, ...) >= sizeof buf)` - this catches both errors (return value of -1) and overflows with a single comparison, due to the fact that `size_t` is unsigned.
R..
+1: Good examples. I'm a bit blind on that spot because I usually avoid code which assumes a exploitable representation of ints<0 in unsigned.
Luther Blissett
@Luther: converting a negative signed value to an unsigned type is well-defined, and doesn't depend on the representation.
Mike Seymour
@Michael: The value of (unsigned)(-1) depends on the size of unsigned. Btw. the method given in 6.3.1.3 seems to suggest that (unsigned)-1 has the same bit pattern as -1 in two-complement arithmetic, does it?
Luther Blissett
@R: I'm not sure if `size_t x ... if (x==-1)` is actually correct. C99 says about `size_t` that is has to be *an unsigned integer type* and should not have *an integer conversion rank greater than signed long*. This seems to suggest that size_t could in fact be, for example a 16 bit `unsigned short` while having 32-bit `int`. In that case `(x==-1)` would fail, wouldn't it?
Luther Blissett
@Luther: I suppose it would, but there has never been, and will never be, a C implementation where `size_t` is smaller than `int`. While the standard doesn't forbid it, it's completely ridiculous.
R..
@Luther: `unsigned` is the same size as `int`, so `(unsigned)(-1)` will be `UINT_MAX`. And yes, the conversion is equivalent to reinterpreting the two-complement bit pattern. You're right about `size_t`; if it is smaller than `int`, then it would be converted to a positive-valued `int` for the comparison.
Mike Seymour
@R..: As far as I can tell, a freestanding C89 implementation for a microcontroller could make `size_t` an 8-bit `unsigned char` (the only minimum requirement I can find is §2.2.4.1 “32767 bytes in an object (in a hosted environment only)”). In C99, `SIZE_MAX>=65535`, so `size_t` shorter than `unsigned int` would be pathological.
Gilles
@Gilles: it doesn't seem pathological to me to have a 32-bit `int` and a 16-bit `size_t`. I think there have been processors with 32-bit data registers and a 16-bit address bus, where those would be the natural sizes.
Mike Seymour
+8  A: 

My answer is for C only.

There is no type in C that can accomodate all possible values of all possible integer types. The closest C99 comes to this is intmax_t and uintmax_t, and their intersection only covers half their respective range.

Therefore, you cannot implement a mathematical value comparison of such as x <= y by first converting x and y to a common type and then doing a simple operation. This is a major departure from a general principle of how operators work. It also breaks the intuition that operators correspond to things that tend to be single instructions in common hardware.

Even if you added this additional complexity to the language (and extra burden to implementation writers), it wouldn't have very nice properties. For example, x <= y would still not be equivalent to x - y <= 0. If you wanted all these nice properties, you'd have to make arbitrary-sized integers part of the language.

I'm sure there's plenty of old unix code out there, possibly some running on your machine, that assumes that (int)-1 > (unsigned)1. (Ok, maybe it was written by freedom fighters ;-)

If you want lisp/haskell/python/$favorite_language_with_bignums_built_in, you know where to find it...

Gilles
+1 for bringing up the still-missing equivalence between `x<=y` and `x-y<=0`.
R..
Hm. I would have to make the type of x-y signed in all cases where any of x or y were signed if I want to save this. I think this counts as 'break the language'.
Luther Blissett
@Luther: you have to make the type of `x-y` big enough to accomodate `UINTMAX_MAX - INTMAX_MIN`, which means it must be one bit bigger than (`u`)`intmax_t`. Therefore, there cannot be a biggest integer type; in other words, the language must have bignums built-in. I wouldn't call the resulting language C.
Gilles
Why? `int1+int2` can't hold `INTMAX_MAX+INTMAX_MAX` either. And almost everyone's happy with ints *that one can't even negate without invoking UB*. So we could either decide to let x-y underflow in the UB (trap, wrap-around) or specify under/overflow behavior that actually makes sense.
Luther Blissett
@Luther: limited-size types, mathematically-intuitive `x<=y`, `x<=y` equivalent to `x-y<=0`: pick two. Your proposal keeps 1 and adds 2 to the language, but I don't see much point in 2 without 3.
Gilles
I'm not convinced that *"intuitive equivalences should hold"* can really make a point in supporting C's way to handle signed/unsigned. For example, `-1<1u` is `0`, but `0<1u+1` is `1` obviously, while `-1-1u<0` is `0` again.
Luther Blissett
As far as the mathematics go, the problem is that naive coders expect `<` to be a mathematical order relation, while the ring they're working in is not the integers but **integers mod 2^n**. There's simply no way to have an order relation that fits the usual algebraic rules (e.g. `a<b => a+c<b+c`) on this ring, so you have to deal with the fact that the `<` relation is actually carving out an interval. You can either whine about this and beg people to "fix" the standard to meet impossible criteria, or you can use it to your advantage to write efficient code. :-)
R..
*"As far as the mathematics go, the problem is that naive coders expect < to be a mathematical order relation, while the ring they're working in is not the integers but integers mod 2^n."* -- We're with mod 2^n only as long as we stay within unsigned numbers. int *has* this order relation (barred UB). What I'm asking is, if it is really necessary to have a completely inconsistent order (which depends on the size of the operands) when it comes to **mixed** sign arithmetic.
Luther Blissett
+1  A: 

I think C++ is like the Roman empire. Its big, and too established to fix the things that are going to destroy it.

c++0x - and boost - are examples of a horrible horrible syntax - the kind of baby only its parents can love - and are a long long way from the simple elegant (but severely limited) c++ of 10 years ago.

The point is, by the time one has "fixed" something as terribly simple as comparisons of integral types, enough legacy & existing c++ code has been broken that one might as well just call it a new language.

And once broken, there is so much else that is also eligible for retroactive fixing.

Chris Becke
The beast is not my kid, and yet I like it... What parts of C++0x do you deem *horrible horrible*? There are some that are awkward... but all languages have those sort of things (just name a language)
David Rodríguez - dribeas
I refer not to any of the features of C++0x - which are all fantastic and sorely needed. I refer to the fact that c++0x and boost (and the stl) are implemented on top of a horrid template (and namespace) syntax.As a simple question, why is `::` used as a scope resolution operator. There are virtually no situations where a `.` would cause ambiguity.Or - instead of templates, why didn't c+ just have `types` as first class variables (or given first class variable usage).i.e. replace `Array<int> theArray`; with `Array theArray(int);`.
Chris Becke
I agree that templates could be a lot cleaner, but I don't see why a `::` operator is particularly horrible. In general, I don't really care what individual syntactic tokens look like. Surely that's not what makes the language ugly.
jalf
`::` is horrible, as is `->`. If the array decomposition rule had been extended to classes, then all class variables would have decomposed to implicit class references. Meaning that `.` could have been used as a universal dereferencing / scope resolution operator. Doing more with less, extended too far, can result in languages that look like lisp. But generally, being able to say `a.b.c.d.e` without caring that a ... e might be a namespace, class name, class instance, class reference or class member is far more elegant than saying `a::b::c.d->e`. imho.
Chris Becke