questions about ieee-754 | ansaurus

ieee-754

Why are c/c++ floating point types so oddly named?

C++ offers three floating point types: float, double, and long double. I infrequently use floating-point in my code, but when I do, I'm always caught out by warnings on innocuous lines like float PiForSquares = 4.0; The problem is that the literal 4.0 is a double, not a float - Which is irritating. For integer types, we have short in...

Does floor() return something that's exactly representable?

In C89, floor() returns a double. Is the following guaranteed to work? double d = floor(3.0 + 0.5); int x = (int) d; assert(x == 3); My concern is that the result of floor might not be exactly representable in IEEE 754. So d gets something like 2.99999, and x ends up being 2. For the answer to this question to be yes, all integers ...

Floating Point to Binary Value(C++)

I want to take a floating point number in C++, like 2.25125, and a int array filled with the binary value that is used to store the float in memory (IEEE 754). So I could take a number, and end up with a int num[16] array with the binary value of the float: num[0] would be 1 num[1] would be 1 num[2] would be 0 num[3] would be 1 and so o...

Checking if a double (or float) is nan in C++

is there an isnan() function? p.s. I'm in mingw (if that makes a difference) UPDATE Thanks for the responses I had this solved by using isnan() form <math.h>, which doesn't exist in <cmath>, which I was #includeing at first. ...

What are the applications/benefits of an 80-bit extended precision data type?

Yeah, I meant to say 80-bit. That's not a typo... My experience with floating point variables has always involved 4-byte multiples, like singles (32 bit), doubles (64 bit), and long doubles (which I've seen refered to as either 96-bit or 128-bit). That's why I was a bit confused when I came across an 80-bit extended precision data type ...

what languages expose IEEE 754 traps to the developer ?

I'd like to play with those traps for educational purpose. A common problem with the default behavior in numerical calculus is that we "miss" the Nan (or +-inf) that appeared in a wrong operation. Default behavior is propagation through the computation, but some operation (like comparisons) break the chain and loose the Nan, and the res...

floating-point-exceptions

what languages get IEEE 754 right ?

Hi all, I just spend my week messing with the subject, and found no language that get the IEEE 754 spec right. Even GCC doesn't respect the relevant C99 part (it ignores the FENV_ACCESS pragma, and I've been told than my working examples where sheer luck). It is impossible (AFAIK) to respect the spec with library functions, you need s...

Ensuring C++ doubles are 64 bits

In my C++ program, I need to pull a 64 bit float from an external byte sequence. Is there some way to ensure, at compile-time, that doubles are 64 bits? Is there some other type I should use to store the data instead? Edit: If you're reading this and actually looking for a way to ensure storage in the IEEE 754 format, have a look at Ada...

Representing integers in doubles

Can a double (of a given number of bytes, with a reasonable mantissa/exponent balance) always fully precisely hold the range of an unsigned integer of half that number of bytes? E.g. can an eight byte double fully precisely hold the range of numbers of a four byte unsigned int? What this will boil down to is if a two byte float can hol...

How can I convert four characters into a 32-bit IEEE-754 float in Perl?

I have a project where a function receives four 8-bit characters and needs to convert the resulting 32-bit IEEE-754 float to a regular Perl number. Seems like there should be a faster way than the working code below, but I have not been able to figure out a simpler pack function that works. does not work - seems like it is close $floa...

Precision of Floating Point

So, I know a little bit about how floating point are represented, but not enough to be sure of my answer. The general question: for a given precision (for my purposes, the number of accurate decimal places in base 10), what range of numbers can be represented for 16-, 32-, and 64-bit IEEE-754 numbers? Specifically, I'm only interested ...

The Double Byte Size in 32 bit and 64 bit OS

Is there a difference in double size when I run my app on 32 and 64 bit environment? If I am not mistaken the double in 32 bit environment will take up 16 digits after 0, whereas the double in 64 bit will take up 32 bit, am I right? ...

Accurate evaluation of 1/1 + 1/2 + ... 1/n row

I need to evaluate the sum of the row: 1/1+1/2+1/3+...+1/n. Considering that in C++ evaluations are not complete accurate, the order of summation plays important role. 1/n+1/(n-1)+...+1/2+1/1 expression gives the more accurate result. So I need to find out the order of summation, which provides the maximum accuracy. I don't even know wh...

Floating point addition: loss-of-precision issues

In short: how can I execute a+b such that any loss-of-precision due to truncation is away from zero rather than toward zero? The Long Story I'm computing the sum of a long series of floating point values for the purpose of computing the sample mean and variance of the set. Since Var(X) = E(X2) - E(X)2, it suffices to maintain running ...

When I'm multiplying a float using multu, should I ignore the result in the LO register?

In our project, we take two floats from the user, store them in integer registers, and treat them as a IEEE 754 single precision floats, manipulating the bits by masking. So after I multiply the 23 bits of fraction value, should I take into account the result placed in the LO register if I want to return a single precision float (32 bits...

Large numbers erroneously rounded in Javascript

See this code: <html> <head> <script src="http://www.json.org/json2.js" type="text/javascript"></script> <script type="text/javascript"> var jsonString = '{"id":714341252076979033,"type":"FUZZY"}'; var jsonParsed = JSON.parse(jsonString); console.log(jsonString, jsonParsed); </script> </head> <body> </body> </html> Wh...

floating-accuracy

Formatting doubles for output in C#

Running a quick experiment related to Is double Multiplication Broken in .NET? and reading a couple of articles on C# string formatting, I thought that this: { double i = 10 * 0.69; Console.WriteLine(i); Console.WriteLine(String.Format(" {0:F20}", i)); Console.WriteLine(String.Format("+ {0:F20}", 6.9 - i)); Console....

Arithmetic in ruby

Why this code 7.30 - 7.20 in ruby returns 0.0999999999999996, not 0.10? But if i'll write 7.30 - 7.16, for example, everything will be ok, i'll get 0.14. What the problem, and how can i solve it? ...

What is the rationale for all comparisons returning false for IEEE754 NaN values?

Why do comparisons of NaN values behave differently from all other values? That is, all comparisons with the operators ==, <=, >=, <, > where one or both values is NaN returns false, contrary to the behaviour of all other values. I suppose this simplifies numerical computations in some way, but I couldn't find an explicitly stated reaso...

Convert a string with a hex representation of an IEEE-754 double into JavaScript numeric variable

Suppose I have a hex number "4072508200000000" and I want the floating point number that it represents (293.03173828125000) in IEEE-754 double format to be put into a JavaScript variable. I can think of a way that uses some masking and a call to pow(), but is there a simpler solution? A client-side solution is needed. This may help. I...

1
2
3
4