views:

197

answers:

3

In Visual C++ i wrote the following sample in a C++ program:

float f1 = 42.48f;
double d1 = 42.48;
double d2 = f1;

I compiled the program with Visual Studio 2005. In the debugger i see the following values:

f1  42.480000   float
d1  42.479999999999997  double
d2  42.479999542236328  double

d1 by my knowledege is OK, but d2 is wrong.

The problem occurs as well with /fp=precise as with /fp=strict as with /fp=fast.

Whats the problem here? Any hint how to avoid this Problem? This leads to serious numerical problems.

+3  A: 

This isn't an issue with VC++ or anything like that - it's a fundamental issue with how floating point numbers are stored on the computer. For more information, see IEEE-754.

The issue is that a conversion from float to double is done such that converting back from double to float results in exactly the same float value that you started with. I'm not aware of any way around the loss of precision, except to use only doubles when you need the longer precision. It may be that trying to round the converted float to two decimal places will set it to the correct value, but I'm not sure of that.

Daniel G
+2  A: 

There is nothing wrong with what is happening here.

Because of the way floating point numbers are represented in memory, 42.479999999999997 is the closest representation of 42.48 that a double can have.

Read this paper: http://docs.sun.com/source/806-3568/ncg_goldberg.html

It explains what's happening there. There is unfortunately nothing you can do about the storage of it.

Salgar
+1 for the link to "What Every Computer Scientist Should Know About Floating-Point Arithmetic"
Paul R
+2  A: 

The value in f1 and the value in d2 both represent the exact same number. That number is not exactly 42.480000, neither is it exactly 42.479999542236328, although it does have a decimal representation which terminates. When displaying floats, your debug view is sensibly rounding at the precision of a float, and when displaying doubles it's rounding at the precision of a double. So you see about twice as many significant figures of the mystery value when you convert and display as a double.

d1 contains a better approximation to 4.48 than the mystery value, since d1 contains the closest double to 4.48, whereas f1 and d2 only contain the closest float value to 4.48. What did you expect d2 to contain? f1 can't "remember" that it's "really supposed to be" 4.48, so that when it converts to double it gets "more accurate".

The way to avoid it depends which serious numerical problems you mean. If the problem is that d1 and d2 don't compare equal, and you think they should, then the answer is to include a small tolerance in your comparisons, for example, replace d1 == d2 with:

fabs(d1 - d2) <= (d2 * FLT_EPSILON)

That is just an example, though, I haven't checked whether it deals with this case. You have to pick a tolerance that works for you, and you might also have to worry that d2 might be zero.

If the problem is that d2 is not a sufficiently accurate value for your algorithm to produce accurate results, then you have to avoid float values, and/or use a more numerically stable algorithm.

Steve Jessop