views:

123

answers:

3

I'm wondering if a number is represented one way in a floating point representation, is it going to be represented in the same way in a representation that has a larger size. That is, if a number has a particular representation as a float, will it have the same representation if that float is cast to a double and then still the same when cast to a long double.

I'm wondering because I'm writing a BigInteger implementation and any floating point number that is passed in I am sending to a function that accepts a long double to convert it. Which leads me to my next question. Obviously floating points do not always have exact representations, so in my BigInteger class what should I be attempting to represent when given a float. Is it reasonable to try and represent the same number as given by std::cout << std::fixed << someFloat; even if that is not the same as the number passed in. Is that the most accurate representation I will be able to get? If so, ...

What's the best way to extract that value (in base some power of 10), at the moment I'm just grabbing it as a string and passing it to my string constructor. This will work, but I can't help but feel theres a better way, but certainly taking the remainder when dividing by my base is not accurate with floats.

Finally, I wonder if there is a floating point equivalent of uintmax_t, that is a typename that will always be the largest floating point type on a system, or is there no point because long double will always be the largest (even if it 's the same as a double).

Thanks, T.

+9  A: 

If by "same representation" you mean "exactly the same binary representation in memory except for padding", then no. Double-precision has more bits of both exponent and mantissa, and also has a different exponent bias. But I believe that any single-precision value is exactly representable in double-precision (except possibly denormalised values).

I'm not sure what you mean when you say "floating points do not always have exact representations". Certainly, not all decimal floating-point values have exact binary floating-point values (and vice versa), but I'm not sure that's a problem here. So long as your floating-point input has no fractional part, then a suitably large "BigInteger" format should be able to represent it exactly.

Conversion via a base-10 representation is not the way to go. In theory, all you need is a bit-array of length ~1024, initialise it all to zero, and then shift the mantissa bits in by the exponent value. But without knowing more about your implementation, there's not a lot more I can suggest!

Oli Charlesworth
Thankyou for your answer. I'm not sure I understand what you mean when you say "a [...] BigInteger format should be able to represent it exactly". Certainly a BigInteger should be able to represent it but how do I get the value in the first place. There are numbers (with no fractional part) which the compiler accepts as a valid float but when I print them I get a different number. Is this a problem with `cout` then and the number is still represented exactly. Sorry, that's a little incoherent, I'm just a bit confused about this. Also if it would help if I post the outline of my ...
tjm
... implementation, I'm happy to do that. It's a bit long though. (and very rough at the moment).
tjm
I assume you mean something like `float f = 123456789123456789.0f;` This is a limitation of floating-point, not of "BigIntegers". BigIntegers ought to be able to represent all possible (integral) values of floats, but not vice versa.
Oli Charlesworth
Yup that's what I mean. So, when presented with a number like that, that I've received as a floating point, the loss of data has already occurred I think. Now is the most accurate representation I can get for that number the one shown on `cout` and if it is, is it reasonable to try and represent the same number in my BigInteger class. When I say reasonable I mean in keeping with how c++ handles these things, and is there a better way of getting that number than simply passing it to my string constructor.
tjm
Ignore what `cout` shows you, because that's based on all sorts of formatting options. What I suggested above (shifting into a huge array of bits) is logically what you need to do; I'll leave the actual implementation up to you. However, the frexp function (http://cplusplus.com/reference/clibrary/cmath/frexp/) will probably be of some use to you.
Oli Charlesworth
+4  A: 

double includes all values of float; long double includes all values of double. So you're not losing any value information by conversion to long double. However, you're losing information about the original type, which is relevant (see below).

In order to follow common C++ semantics, conversion of a floating point value to integer should truncate the value, not round.

The main problem is with large values that are not exact. You can use the frexp function to find the base 2 exponent of the floating point value. You can use std::numeric_limits<T>::digits to check if that's within the integer range that can be exactly represented.

My personal design choice would be an assert that the fp value is within the range that can be exactly represented, i.e. a restriction on the range of any actual argument.

To do that properly you need overloads taking float and double arguments, since the range that can be represented exactly depends on the actual argument's type.

When you have an fp value that is within the allowed range, you can use floor and fmod to extract digits in any numeral system you want.

Alf P. Steinbach
+1. Although I don't agree with your design choice. It goes against the idea of naturally extending C++ semantics. I can certainly assign a 32bit float with a 24 bit mantissa to a 64bit int if the value -- after truncation -- fits in 64 bits.
sellibitze
+1 from me too, and thankyou for your response, it is helpful. Unfortunately I really would prefer not to go down the route of artificially limiting the range of values, if c++ accepts it, I would like to accept it, I'm just not sure exactly what value I should be accepting!
tjm
`floor(abs(v))`, with sign as the original value `v`. The reason I suggested not accepting an "inexact" value is that the point of arbitrary or extended precision integer arithmetic is usually to have exact results. I should have added a weasel phrase "by default", I mean, why not just support both? :-)
Alf P. Steinbach
A: 

yes, going from IEEE float to double to extended you will see bits from the smaller format to the larger format, for example

single
S EEEEEEEE MMMMMMM.....
double 
S EEEEEEEEEEEE MMMMM....

6.5 single
0 10000001 101000...
6.5 double
0 10000000001 101000...
13 single
0 10000010 101000...
13 double
0 10000000010 101000...

The mantissa you will left justify and then add zeros.

The exponent is right justified, sign extend the next to msbit then copy the msbit.

An exponent of -2 for example. take -2 subtract 1 which is -3. -3 in twos complement is 0xFD or 0b11111101 but the exponent bits in the format are 0b01111101, the msbit inverted. And for double a -2 exponent -2-1 = -3. or 0b1111...1101 and that becomes 0b0111...1101, the msbit inverted. (exponent bits = twos_complement(exponent-1) with the msbit inverted).

As we see above an exponent of 3 3-1 = 2 0b000...010 invert the upper bit 0b100...010

So yes you can take the bits from single precision and copy them to the proper locations in the double precision number. I dont have an extended float reference handy but pretty sure it works the same way.

dwelch
that is within a floating point format like IEEE 754. If you want to convert from IEEE 754 to TI dsp format for example, it doesnt work that way, cannot copy the bits. Typically though within the same standard the various precisions extend the mantissa more to the right and the exponent to the left adding more precision without re-defining how they work.
dwelch