views:

194

answers:

5

I am looking for suggestions on how to find the sizes (in bits) and range of floating point numbers in an architecture independent manner. The code could be built on various platforms (AIX, Linux, HPUX, VMS, maybe Windoze) using different flags - so results should vary. The sign, I've only seen as one bit, but how to measure the size of the exponent and mantissa?

+3  A: 

Have a look at the values defined in float.h. Those should give you the values you need.

mipadi
+1  A: 

The number of bits used to store each field in a floating point number doesn't change.

                      Sign      Exponent   Fraction   Bias
Single Precision    1 [31]     8 [30-23]      23 [22-00]         127
Double Precision    1 [63]    11 [62-52]      52 [51-00]        1023

EDIT: As Jonathan pointed out in the comments, I left out the long double type. I'll leave its bit decomposition as an exercise for the reader. :)

Bill the Lizard
Not true, for some values of true ;^)~There exist a small number of platforms that do not use IEEE754. But for the most part you are of course correct.
Don Wakefield
You missed out long double.
Jonathan Leffler
@Don: *Very* small values of true. :)
Bill the Lizard
@Jonathan: Thanks, I edited my response. Long double was a pretty late addition to the standard, but it's worth at least a footnote.
Bill the Lizard
+4  A: 

Since you're looking at building for a number of systems, I think you may be looking at using GCC for compilation.

Some good info on floating point - this is what almost all modern architectures use: http://en.wikipedia.org/wiki/IEEE_754

This details some of the differences that can come up http://www.network-theory.co.uk/docs/gccintro/gccintro_70.html

Andrew Theken
+3  A: 

As you follow the links suggested in previous comments, you'll probably see references to What Every Computer Scientist Should Know About Floating Point Arithmetic. By all means, take the time to read this paper. It pops up everywhere when floating point is discussed.

Don Wakefield
+1  A: 

Its relatively easy to find out:

Decimal or binary;

myfloat a = 2.0, b = 0.0;

for (int i=0; i<20; i++) b += 0.1;

(a == b) => decimal, else binary

Reason: All binary systems can represent 2.0, but any binary system will have an error term for representing 0.1. By accumulating you can be sure that this error term will not vanish like in rounding: e.g. 1.0 == 3.0*(1.0/3.0) even in binary systems

Mantissa length:

Myfloat a = 1.0, b = 1.0, c, inc = 1.0;

int mantissabits = 0;

do { mantissabits++; inc *= 0.5; // effectively shift to the right c = b+inc; } while (a != c);

You are adding decreasing terms until you reach the capacity of the mantissa. It gives back 24 bits for float and 53 bits for double which is correct (The mantissa itself contains only 23/52 bits, but as the first bit is always one on normalized values, you have a hidden extra bit).

Exponent length: Myfloat a = 1.0; int max = 0, min = 0;

while (true) { a *= 2.0; if (a != NaN && a != Infinity && whatever) // depends on system max++; else break; }

a = 1.0; while (true) { a *= 0.5; if (a != 0.0) min--; else break; }

You are shifting 1.0 to the left or to the right until you hit the top or the bottom. Normally the exp range is -(max+1) - max. If min is smaller than -(max+1), you have (as floats and doubles have) subnormals. Normally positive and negative values are symmetric (with perhaps one offset), but you can adjust the test by adding negative values.