ansaurus

Question

Would it break the language or existing code if we'd add safe signed/unsigned compares to C/C++?

Answer 1

+7 A:

I do not think it would break the language, but yes, it could break some existing code (and the breakage would be probably hard to detect at the compiler level).

There exists a lot more code written in C and C++ than you and I together can imagine (some of it may be even written by terrorists).

Relying on "proposition that (short)-1 > (unsigned)1" may be done unintentionally by someone. There exists a lot of C code dealing with complex bit manipulation and similar things. It is quite possible some programmer may be using the current comparison behaviour in such code. (Other people have already provided nice examples of such code, and a the code is even simpler than I would expect).

Current solution is to warn on such comparisons instead, and leave the solution to the programmer, which I think is in a spirit how C and C++ works. Also, solving it on a compiler level would incur a performance penalty, and this is something C and C++ programmers are extremely sensitive at. Two tests instead of one might seem like a minor issue to you, but there is probably plenty of C code where this would be an issue. It could be solved e.g. by forcing the previous behaviour by using explicit casts to a common data type - but this again would require programmer attention, therefore it is no better than a simple warning.

Suma 2010-08-13 12:11:12

There's no reason that this couldn't be added to a later spec and old code continue to compile at C89 or C99

Peter Gibson 2010-08-13 12:23:44

Sure there is - the language is not supposed to gratuitously break things from version to version. What do you do when you're using a library that expects one behavior with an application that expects the other? (Especially if part of the library code is macros or inline functions in headers?) Sure there will sometimes be minor breakage, but something as fundamental as this should never be touched!

R.. 2010-08-13 12:32:14

Answer 2

+6 A:

Yes it would break the language/existing code. The language, as you have noted, carefully specifies the behavior when signed and unsigned operands are used together. This behavior with comparison operators is essential for some important idioms, like:

if (x-'0' < 10U)

Not to mention things like (equality comparison):

size_t l = mbrtowc(&wc, s, n, &state);
if (l==-1) ... /* Note that mbrtowc returns (size_t)-1 on failure */

As an aside, specifying "natural" behavior for mixed signed/unsigned comparisons would also incur a significant performance penalty, even in programs which are presently using such comparisons in safe ways where they already have their "natural" behavior due to constraints on the input which the compiler would have a hard time determining (or might not be able to determine at all). In writing your own code to handle these tests, I'm sure you've already seen what the performance penalty would look like, and it's not pretty.

R.. 2010-08-13 12:21:36

Another common example: `if (snprintf(buf, sizeof buf, ...) >= sizeof buf)` - this catches both errors (return value of -1) and overflows with a single comparison, due to the fact that `size_t` is unsigned.

R.. 2010-08-13 12:27:51

+1: Good examples. I'm a bit blind on that spot because I usually avoid code which assumes a exploitable representation of ints<0 in unsigned.

Luther Blissett 2010-08-13 12:43:56

@Luther: converting a negative signed value to an unsigned type is well-defined, and doesn't depend on the representation.

Mike Seymour 2010-08-13 13:45:09

@Michael: The value of (unsigned)(-1) depends on the size of unsigned. Btw. the method given in 6.3.1.3 seems to suggest that (unsigned)-1 has the same bit pattern as -1 in two-complement arithmetic, does it?

Luther Blissett 2010-08-13 14:13:51

@R: I'm not sure if `size_t x ... if (x==-1)` is actually correct. C99 says about `size_t` that is has to be *an unsigned integer type* and should not have *an integer conversion rank greater than signed long*. This seems to suggest that size_t could in fact be, for example a 16 bit `unsigned short` while having 32-bit `int`. In that case `(x==-1)` would fail, wouldn't it?

Luther Blissett 2010-08-13 15:09:36

@Luther: I suppose it would, but there has never been, and will never be, a C implementation where `size_t` is smaller than `int`. While the standard doesn't forbid it, it's completely ridiculous.

R.. 2010-08-13 15:27:27

@Luther: `unsigned` is the same size as `int`, so `(unsigned)(-1)` will be `UINT_MAX`. And yes, the conversion is equivalent to reinterpreting the two-complement bit pattern. You're right about `size_t`; if it is smaller than `int`, then it would be converted to a positive-valued `int` for the comparison.

Mike Seymour 2010-08-13 15:35:29

@R..: As far as I can tell, a freestanding C89 implementation for a microcontroller could make `size_t` an 8-bit `unsigned char` (the only minimum requirement I can find is §2.2.4.1 “32767 bytes in an object (in a hosted environment only)”). In C99, `SIZE_MAX>=65535`, so `size_t` shorter than `unsigned int` would be pathological.

Gilles 2010-08-13 16:00:05

@Gilles: it doesn't seem pathological to me to have a 32-bit `int` and a 16-bit `size_t`. I think there have been processors with 32-bit data registers and a 16-bit address bus, where those would be the natural sizes.

Mike Seymour 2010-08-13 17:57:26

Answer 3

+8 A:

My answer is for C only.

There is no type in C that can accomodate all possible values of all possible integer types. The closest C99 comes to this is intmax_t and uintmax_t, and their intersection only covers half their respective range.

Therefore, you cannot implement a mathematical value comparison of such as x <= y by first converting x and y to a common type and then doing a simple operation. This is a major departure from a general principle of how operators work. It also breaks the intuition that operators correspond to things that tend to be single instructions in common hardware.

Even if you added this additional complexity to the language (and extra burden to implementation writers), it wouldn't have very nice properties. For example, x <= y would still not be equivalent to x - y <= 0. If you wanted all these nice properties, you'd have to make arbitrary-sized integers part of the language.

I'm sure there's plenty of old unix code out there, possibly some running on your machine, that assumes that (int)-1 > (unsigned)1. (Ok, maybe it was written by freedom fighters ;-)

If you want lisp/haskell/python/$favorite_language_with_bignums_built_in, you know where to find it...

Gilles 2010-08-13 12:28:37

+1 for bringing up the still-missing equivalence between `x<=y` and `x-y<=0`.

R.. 2010-08-13 12:38:05

Hm. I would have to make the type of x-y signed in all cases where any of x or y were signed if I want to save this. I think this counts as 'break the language'.

Luther Blissett 2010-08-13 13:09:52

@Luther: you have to make the type of `x-y` big enough to accomodate `UINTMAX_MAX - INTMAX_MIN`, which means it must be one bit bigger than (`u`)`intmax_t`. Therefore, there cannot be a biggest integer type; in other words, the language must have bignums built-in. I wouldn't call the resulting language C.

Gilles 2010-08-13 13:36:43

Why? `int1+int2` can't hold `INTMAX_MAX+INTMAX_MAX` either. And almost everyone's happy with ints *that one can't even negate without invoking UB*. So we could either decide to let x-y underflow in the UB (trap, wrap-around) or specify under/overflow behavior that actually makes sense.

Luther Blissett 2010-08-13 14:20:45

@Luther: limited-size types, mathematically-intuitive `x<=y`, `x<=y` equivalent to `x-y<=0`: pick two. Your proposal keeps 1 and adds 2 to the language, but I don't see much point in 2 without 3.

Gilles 2010-08-13 14:35:18

I'm not convinced that *"intuitive equivalences should hold"* can really make a point in supporting C's way to handle signed/unsigned. For example, `-1<1u` is `0`, but `0<1u+1` is `1` obviously, while `-1-1u<0` is `0` again.

Luther Blissett 2010-08-13 14:54:55

As far as the mathematics go, the problem is that naive coders expect `<` to be a mathematical order relation, while the ring they're working in is not the integers but **integers mod 2^n**. There's simply no way to have an order relation that fits the usual algebraic rules (e.g. `a<b => a+c<b+c`) on this ring, so you have to deal with the fact that the `<` relation is actually carving out an interval. You can either whine about this and beg people to "fix" the standard to meet impossible criteria, or you can use it to your advantage to write efficient code. :-)

R.. 2010-08-13 15:33:38

*"As far as the mathematics go, the problem is that naive coders expect < to be a mathematical order relation, while the ring they're working in is not the integers but integers mod 2^n."* -- We're with mod 2^n only as long as we stay within unsigned numbers. int *has* this order relation (barred UB). What I'm asking is, if it is really necessary to have a completely inconsistent order (which depends on the size of the operands) when it comes to **mixed** sign arithmetic.

Luther Blissett 2010-08-13 17:06:23

Answer 4

+1 A:

I think C++ is like the Roman empire. Its big, and too established to fix the things that are going to destroy it.

c++0x - and boost - are examples of a horrible horrible syntax - the kind of baby only its parents can love - and are a long long way from the simple elegant (but severely limited) c++ of 10 years ago.

The point is, by the time one has "fixed" something as terribly simple as comparisons of integral types, enough legacy & existing c++ code has been broken that one might as well just call it a new language.

And once broken, there is so much else that is also eligible for retroactive fixing.

Chris Becke 2010-08-13 13:01:16

The beast is not my kid, and yet I like it... What parts of C++0x do you deem *horrible horrible*? There are some that are awkward... but all languages have those sort of things (just name a language)

David Rodríguez - dribeas 2010-08-13 13:08:47

I refer not to any of the features of C++0x - which are all fantastic and sorely needed. I refer to the fact that c++0x and boost (and the stl) are implemented on top of a horrid template (and namespace) syntax.As a simple question, why is `::` used as a scope resolution operator. There are virtually no situations where a `.` would cause ambiguity.Or - instead of templates, why didn't c+ just have `types` as first class variables (or given first class variable usage).i.e. replace `Array<int> theArray`; with `Array theArray(int);`.

Chris Becke 2010-08-13 13:24:57

I agree that templates could be a lot cleaner, but I don't see why a `::` operator is particularly horrible. In general, I don't really care what individual syntactic tokens look like. Surely that's not what makes the language ugly.

jalf 2010-08-13 14:07:08

`::` is horrible, as is `->`. If the array decomposition rule had been extended to classes, then all class variables would have decomposed to implicit class references. Meaning that `.` could have been used as a universal dereferencing / scope resolution operator. Doing more with less, extended too far, can result in languages that look like lisp. But generally, being able to say `a.b.c.d.e` without caring that a ... e might be a namespace, class name, class instance, class reference or class member is far more elegant than saying `a::b::c.d->e`. imho.

Chris Becke 2010-08-13 15:05:13

ansaurus

tags:

views:

answers:

Would it break the language or existing code if we'd add safe signed/unsigned compares to C/C++?

related questions