ansaurus

Question

Using C: How can I determine the sizes of the components of a floating point?

Answer 1

+3 A:

Have a look at the values defined in float.h. Those should give you the values you need.

mipadi 2008-11-02 14:39:09

Answer 2

+1 A:

The number of bits used to store each field in a floating point number doesn't change.

                      Sign      Exponent   Fraction   Bias
Single Precision    1 [31]     8 [30-23]      23 [22-00]         127
Double Precision    1 [63]    11 [62-52]      52 [51-00]        1023

EDIT: As Jonathan pointed out in the comments, I left out the long double type. I'll leave its bit decomposition as an exercise for the reader. :)

Bill the Lizard 2008-11-02 14:40:07

Not true, for some values of true ;^)~There exist a small number of platforms that do not use IEEE754. But for the most part you are of course correct.

Don Wakefield 2008-11-02 16:25:30

You missed out long double.

Jonathan Leffler 2008-11-02 17:02:08

@Don: *Very* small values of true. :)

Bill the Lizard 2008-11-02 18:22:03

@Jonathan: Thanks, I edited my response. Long double was a pretty late addition to the standard, but it's worth at least a footnote.

Bill the Lizard 2008-11-02 18:26:35

Answer 3

+4 A:

Since you're looking at building for a number of systems, I think you may be looking at using GCC for compilation.

Some good info on floating point - this is what almost all modern architectures use: http://en.wikipedia.org/wiki/IEEE_754

This details some of the differences that can come up http://www.network-theory.co.uk/docs/gccintro/gccintro_70.html

Andrew Theken 2008-11-02 14:46:12

Answer 4

+3 A:

As you follow the links suggested in previous comments, you'll probably see references to What Every Computer Scientist Should Know About Floating Point Arithmetic. By all means, take the time to read this paper. It pops up everywhere when floating point is discussed.

Don Wakefield 2008-11-02 16:23:52

Answer 5

+1 A:

Its relatively easy to find out:

Decimal or binary;

myfloat a = 2.0, b = 0.0;

for (int i=0; i<20; i++) b += 0.1;

(a == b) => decimal, else binary

Reason: All binary systems can represent 2.0, but any binary system will have an error term for representing 0.1. By accumulating you can be sure that this error term will not vanish like in rounding: e.g. 1.0 == 3.0*(1.0/3.0) even in binary systems

Mantissa length:

Myfloat a = 1.0, b = 1.0, c, inc = 1.0;

int mantissabits = 0;

do { mantissabits++; inc *= 0.5; // effectively shift to the right c = b+inc; } while (a != c);

You are adding decreasing terms until you reach the capacity of the mantissa. It gives back 24 bits for float and 53 bits for double which is correct (The mantissa itself contains only 23/52 bits, but as the first bit is always one on normalized values, you have a hidden extra bit).

Exponent length: Myfloat a = 1.0; int max = 0, min = 0;

while (true) { a *= 2.0; if (a != NaN && a != Infinity && whatever) // depends on system max++; else break; }

a = 1.0; while (true) { a *= 0.5; if (a != 0.0) min--; else break; }

You are shifting 1.0 to the left or to the right until you hit the top or the bottom. Normally the exp range is -(max+1) - max. If min is smaller than -(max+1), you have (as floats and doubles have) subnormals. Normally positive and negative values are symmetric (with perhaps one offset), but you can adjust the test by adding negative values.

2009-02-22 10:59:39

ansaurus

tags:

views:

answers:

Using C: How can I determine the sizes of the components of a floating point?

related questions