views:

836

answers:

6

Why is the result of this explicit cast different from the implicit one?

#include <stdio.h>

double  a;
double  b;
double  c;

long    d;

double    e;

int main() {
    a = 1.0;
    b = 2.0;
    c = .1;

    d = (b - a + c) / c;
    printf("%li\n", d);        //    10

    e = (b - a + c) / c;
    d = (long) e;
    printf("%li\n", d);        //    11
    }

If I do d = (long) ((b - a + c) / c); I also get 10. Why does the assignment to a double make a difference?

+12  A: 

I suspect the difference is a conversion from an 80-bit floating point value to a long vs a conversion from an 80-bit floating point value to a 64-bit one and then a conversion to a long.

(The reason for 80 bits coming up at all is that that's a typical precision used for actual arithmetic, and the width of floating point registers.)

Suppose the 80-bit result is something like 10.999999999999999 - the conversion from that to a long yields 10. However, the nearest 64-bit floating point value to the 80-bit value is actually 11.0, so the two-stage conversion ends up yielding 11.

EDIT: To give this a bit more weight...

Here's a Java program which uses arbitrary-precision arithmetic to do the same calculation. Note that it converts the double value closest to 0.1 into a BigDecimal - that value is 0.1000000000000000055511151231257827021181583404541015625. (In other words, the exact result of the calculation is not 11 anyway.)

import java.math.*;

public class Test
{
    public static void main(String[] args)
    {
        BigDecimal c = new BigDecimal(0.1d);        
        BigDecimal a = new BigDecimal(1d);
        BigDecimal b = new BigDecimal(2d);

        BigDecimal result = b.subtract(a)
                             .add(c)
                             .divide(c, 40, RoundingMode.FLOOR);
        System.out.println(result);
    }
}

Here's the result:

10.9999999999999994448884876874217606030632

In other words, that's correct to about 40 decimal digits (way more than either 64 or 80 bit floating point can handle).

Now, let's consider what this number looks like in binary. I don't have any tools to easily do the conversion, but again we can use Java to help. Assuming a normalised number, the "10" part ends up using three bits (one less than for eleven = 1011). That leaves 60 bits of mantissa for extended precision (80 bits) and 48 bits for double precision (64 bits).

So, what's the closest number to 11 in each precision? Again, let's use Java:

import java.math.*;

public class Test
{
    public static void main(String[] args)
    {
        BigDecimal half = new BigDecimal("0.5");        
        BigDecimal eleven = new BigDecimal(11);

        System.out.println(eleven.subtract(half.pow(60)));
        System.out.println(eleven.subtract(half.pow(48)));        
    }
}

Results:

10.999999999999999999132638262011596452794037759304046630859375
10.999999999999996447286321199499070644378662109375

So, the three numbers we've got are:

Correct value: 10.999999999999999444888487687421760603063...
11-2^(-60): 10.999999999999999999132638262011596452794037759304046630859375
11-2^(-48): 10.999999999999996447286321199499070644378662109375

Now work out the closest value to the correct one for each precision - for extended precision, it's less than 11. Round each of those values to a long, and you end up with 10 and 11 respectively.

Hopefully this is enough evidence to convince the doubters ;)

Jon Skeet
It's an educated guess having seen similar effects in C#. It will be processor and compiler-dependent btw. Am I 100% sure this is what's happening? No. Do I think it's a very likely explanation? Absolutely. More useful than "works on my machine" IMO.
Jon Skeet
http://babbage.cs.qc.edu/IEEE-754/ is very helpful for this sort of thing, although it only has 32- and 64-bit calculators, not an 80-bit calculator.
Adam Rosenfield
@Adam: Thanks very much for the link. Useful indeed. It would be useful if the final "decimal" value was the *exact* value represented by the closest double though.
Jon Skeet
x87 isn't dead yet, but still all floating-point registers aren't 80 bits wide. Even on x87, calculations aren't necessarily done on 80 bits sizes. I expect that some compilers use a mantissa of the same size as in the 64 bits format. Do you have any information about that?
Bastien Léonard
Does anybody know what the standards say about this? It seems that this is a pretty good gotcha that's not expected even when you're watching out for the usual rounding/truncation problems. It also seems to be compiler/proc dependent. My reading of the doc JP references leads me to believe...
Dennis Williamson
...that the 80 bit stuff should be more hidden.
Dennis Williamson
@Dennis: It's the normal caveat of not trusting too much to the precise results of computations. The fact that you're then rounding the results just makes the difference big :) (No idea what the standards say in C, I'm afraid. .NET has a certain amount of leeway.)
Jon Skeet
+1 awesome answer + code to prove it.
Mehrdad Afshari
@Mehrdad: Of course this isn't *quite* proof that it's what's actually happening. It's proof that it's a reasonable explanation though :)
Jon Skeet
A: 

Straight copy/paste and compile on Linux gives me 11 for both. Adding d = (long) ((b - a + c) / c); also gives 11. Same goes on OpenBSD.

dwc
OS is unlikely to matter. Compiler + options + processor are far more relevant.
Jon Skeet
+1  A: 

codepad.org (gcc 4.1.2) reverses the results of your example, while on my local system (gcc 4.3.2) I get 11 in both cases. This suggests to me that it is a floating point issue. Alternatively, it could theoretically be truncating (b - a + c) which, in an integer context would evaluate to (2 - 1 + 0) / .1, which would be 10, whereas in a float context (2.0 - 1.0 + 0.1) / .1 = 1.1 / .1 = 11. That would be weird though.

Volte
The value of c isn't 0.1 to start with. It's just the closest double to 0.1.
Jon Skeet
+1  A: 

Is it possible that you accidentally casted the numerator to a long instead of the entire operation?

I believe this would result in 10:

e = ((long)(b-a+c))/c;
d = (long)e;

??

Joseph
No, I tested that. It gives me 9.
Dennis Williamson
A: 

Here is a bunch of detail on floating point issues and a really good article. But basically, not all floating point values can be represented by a certain number of bits (32-bits or 64-bits or whatever). This is a deep subject, but one I like because it reminds me of Prof. Kahan. :)

JP Alioto
+1  A: 

I get 10 & 11 on my 32-bit x86 linux system running gcc 4.3.2, too.

The relevant C/asm is here:

26:foo.c         ****     d = (b - a + c) / c;                                               
  42                            .loc 1 26 0
  43 0031 DD050000              fldl    b
  43      0000
  44 0037 DD050000              fldl    a
  44      0000
  45 003d DEE9                  fsubrp  %st, %st(1)
  46 003f DD050000              fldl    c
  46      0000
  47 0045 DEC1                  faddp   %st, %st(1)
  48 0047 DD050000              fldl    c
  48      0000
  49 004d DEF9                  fdivrp  %st, %st(1)
  50 004f D97DFA                fnstcw  -6(%ebp)
  51 0052 0FB745FA              movzwl  -6(%ebp), %eax
  52 0056 B40C                  movb    $12, %ah
  53 0058 668945F8              movw    %ax, -8(%ebp)
  54 005c D96DF8                fldcw   -8(%ebp)
  55 005f DB5DF4                fistpl  -12(%ebp)
  56 0062 D96DFA                fldcw   -6(%ebp)
  57 0065 8B45F4                movl    -12(%ebp), %eax
  58 0068 A3000000              movl    %eax, d
  58      00
  27:foo.c         ****
  28:foo.c         ****     printf("%li\n", d);                                                
  59                            .loc 1 28 0
  60 006d A1000000              movl    d, %eax
  60      00
  61 0072 89442404              movl    %eax, 4(%esp)
  62 0076 C7042400              movl    $.LC3, (%esp)
  62      000000
  63 007d E8FCFFFF              call    printf
  63      FF
  29:foo.c         ****     //    10                                                           
  30:foo.c         ****
  31:foo.c         ****     e = (b - a + c) / c;                                               
  64                            .loc 1 31 0
  65 0082 DD050000              fldl    b
  65      0000
  66 0088 DD050000              fldl    a
  66      0000
  67 008e DEE9                  fsubrp  %st, %st(1)
  68 0090 DD050000              fldl    c
  68      0000
  69 0096 DEC1                  faddp   %st, %st(1)
  70 0098 DD050000              fldl    c
  70      0000
  71 009e DEF9                  fdivrp  %st, %st(1)
  72 00a0 DD1D0000              fstpl   e
  72      0000
  32:foo.c         ****
  33:foo.c         ****     d = (long) e;                                                      
  73                            .loc 1 33 0
  74 00a6 DD050000              fldl    e
  74      0000
  75 00ac D97DFA                fnstcw  -6(%ebp)
  76 00af 0FB745FA              movzwl  -6(%ebp), %eax
  77 00b3 B40C                  movb    $12, %ah
  78 00b5 668945F8              movw    %ax, -8(%ebp)
  79 00b9 D96DF8                fldcw   -8(%ebp)
  80 00bc DB5DF4                fistpl  -12(%ebp)
  81 00bf D96DFA                fldcw   -6(%ebp)
  82 00c2 8B45F4                movl    -12(%ebp), %eax
  83 00c5 A3000000              movl    %eax, d
  83      00

The answer is left as an exercise for the interested reader.