views:

752

answers:

3

I'm wanting to find the number of mantissa digits and the unit round-off on a particular computer. I have an understanding of what these are, just no idea how to find them - though I understand they can vary from computer to computer.

I need this number in order to perform certain aspects of numerical analysis, like analyzing errors.

What I'm currently thinking is that I could write a small c++ program to slowly increment a number until overflow occurs, but I'm not sure what type of number to use.

Am I on the right track? How exactly does one go about calculating this?

+4  A: 

I would think that whatever language you were using would specify how floats were stored. I know Java does this by use of a specific IEEE standard (754, I think).

If it's not specified, I would think you could just do your own check by adding 0.5 to 1 to see if the actual number changes. If it does, then add 0.25 to 1, the 0.125 to 1, and so on until the number doesn't change, something like:

float a = 1;
float b = 0.5;
int bits = 0;
while (a + b != a) {
    bits = bits + 1;
    b = b / 2;
}

If you only had 3 mantissa bits, then 1 + 1/16 would be equal to 1.

Then you've exhausted your mantissa bits.

You might actually need the base number to be 2 rather than 1, since IEEE754 uses an implied '1+' at the start.

EDIT:

It appears the method described above may have some issues as it gives 63 bits for a system that clearly has 4-byte floats.

Whether that's to do with intermediate results (I doubt it since the same code with explicit casts [while (((float)(a + b) != (float)(a))] has similar problems) or (more likely, I believe) the possibility that the unit value a can be represented with bits closer to the fractional b by adjusting the exponent, I don't yet know.

For now, it's best to rely on the language information I mentioned above such as use of IEEE754 (if that information is available).

I'll leave the problematic code in as a trap for wary players. Maybe someone with more floating point knowledge then I can leave a note explaining why it acts strangely (no conjecture, please :-).

EDIT 2:

This piece of code fixes it by ensuring intermediates are stored in floats. Turns out Jonathan Leffler was right - it was intermediate results.

#include <stdio.h>
#include <float.h>

int main(void) {
    float a = 1;
    float b = 0.5;
    float c = a + b;
    int bits = 1;
    while (c != a) {
        bits = bits + 1;
        b = b / 2;
        c = a + b;
    }
    printf("%d\n",FLT_MANT_DIG);
    printf("%d\n",bits);
    return 0;

}

This code outputs (24,24) to show that the calculated value matches the one in the header file.

Whilst written in C, it should be applicable to any language (specifically one where the information isn't available in a header or by virtue that it's specified in the language documentation). I only tested in C because Eclipse takes so long to start on my Ubuntu box :-).

paxdiablo
Thanks for taking the extra time to solve the initial problem - I understood your logic, and was also confused as to why the initial way wasn't giving the same answer as FLT_MANT_DIG.
Zachary
A: 

For C, and by extension C++, the information is in the <float.h> or <cfloat> headers.

For C99, the information is in section 5.2.4.2.2 of the standard:

  • FLT_RADIX
  • FLT_MANT_DIG
  • FLT_DIG
  • FLT_EPSILON
  • FLT_MIN_EXP
  • FLT_MIN
  • FLT_MIN_10_EXP
  • FLT_MAX_EXP
  • FLT_MAX
  • FLT_MAX_10_EXP

And similarly for DBL and LDBL variations on most of these (no DBL_RADIX or LDBL_RADIX). The standard suggests values appropriate for IEEE 754 (the older version of the IEEE 754 standard that was current in 1999; there was a new version published in, I believe, 2008).

Jonathan Leffler
Using the method that Pax proposed yielded an answer of 63, however FLT_MANT_DIG says that he mantissa has 23 digits. I suppose there is something I'm not understanding, but why the discrepancy?
Zachary
@Zachary, did you use 1 or 2 as the base for adding the reducing fractions?
paxdiablo
I used 1 the first time, which gave 63, and 2 the second time, which gave 62.
Zachary
Interesting, I get the same results. But, if the difference between 1 and 1+2^-63 is visible, there must be 63 binary digits.
paxdiablo
If you use 3, you also get 62, 4 thru 7 gets you 61, 8 gets you 60. In other words each doubling of the non-fractional part gives you one less bit for the fraction.
paxdiablo
There's no way that anything other than an 80-bit intermediate result will give you 62 bits for the mantissa. A C float is normally just 32-bits in total, including the exponent and mantissa. So, the computation is telling you about intermediate results, not about float values.
Jonathan Leffler
I'm unconvinced, @JL. With specific casts, it still finds a difference between 1 and 1+2^-63. I think the 1 may be being scaled down (bitwise, adjusting the exponent) so it's closer to the fractional part, hence able to fit withing the 32 bits. I'll investigate some more.
paxdiablo
Note that if you use 0 as your base the answer comes to 149.
Zachary
@Zachary, see my updated answer, @JL was right regarding intermediates.
paxdiablo
+1  A: 

You might want to check out <limits> in your C++ library:

#include <iostream>
#include <limits>
#include <typeinfo>

template <typename T>
void printDetailsFor() {
    std::cout
     << "Printing details for " << typeid(T).name() << ":\n"
     << "\tradix:        " << std::numeric_limits<T>::radix        << "\n"
     << "\tradix digits: " << std::numeric_limits<T>::digits       << "\n"
     << "\tepsilon:      " << std::numeric_limits<T>::epsilon()    << "\n"
     << std::endl;
}

int main() {
    printDetailsFor<int>();
    printDetailsFor<float>();
    printDetailsFor<double>();
    printDetailsFor<long double>();
    return 0;
}

I think that you want std::numeric_limits<T>::digits which should be one more than the number of mantissa bits. My machine prints out:

Printing details for i:
    radix:        2
    radix digits: 31
    epsilon:      0

Printing details for f:
    radix:        2
    radix digits: 24
    epsilon:      1.19209e-07

Printing details for d:
    radix:        2
    radix digits: 53
    epsilon:      2.22045e-16

Printing details for e:
    radix:        2
    radix digits: 64
    epsilon:      1.0842e-19
D.Shawley