Floats, doubles and half floats | ansaurus

tags:

views:

142

answers:

2

+1 Q:

Floats, doubles and half floats

I was wondering about how bits are organized on floats (4 bytes), double (8 bytes) and half floats (2 bytes, used on OpenGL implementation).

Further, how I could convert from one to another?

A:

Half, Single, Double

Handy-dandy diagrams on those pages. The library should provide means for converting between the various formats.

Anon. 2010-02-15 23:02:51

+2 A:

In essence for each of these formats, you have:

1 sign bit
x exponent bits yielding a whole number E
y mantissa (or "significand") bits yielding a fractional number M

If the sign bit is 1, the number is negative, else it is positive.

To get the magnitude, you take (1 + M) * 2^(E - k), where k (called the "exponent bias") depends on the format.

It's worth noting that certain combinations of sign, exponent, and mantissa are "special" values, like 0, -inf, +inf, and NaN.

For the specifics (values of x, y, and k) see Wikipedia for single precision (4 bytes), double precision (8 bytes), and half precision (2 bytes).

Note that these are all specified by IEEE 754, so googling that might give you helpful results. :)

Sapph 2010-02-15 23:10:27

GPU's may not be fully conformant to IEEE 754 (for example they frequently omit support for denormals)

Spudd86 2010-06-14 16:14:10

related questions

Why do I see a double variable initialized to some value like 21.4 as 21.399999618530273?

ParseFloat function in JavaScript

Haskell FFI / C MPFR library wrapper woes

When to use Fixed Point these days

Confusing return statement

How to alter a float by its smallest increment (or close to it)?

Significant figures in the decimal module

Test if a floating point number is an integer

SQL server 2005 numeric precision loss

Why would I use 2's complement to compare two doubles instead of comparing their differences against an epsilon value?

Double and floats in C#

Converting floating point exceptions into C++ exceptions

How to manually parse a floating point number from a string

What is the fastest way to convert float to int on x86

Determine if a string is an integer or a float in ANSI C

cout prints "-0" instead of "0"

how to use "%f" to populate a double value into a string with the right precision

Convert from scientific notation string to float in C#

How is floating point stored? When does it matter?

round() in Python doesn't seem to be rounding properly

How do you generate a random number in C#?

Comparing IEEE floats and doubles for equality

Most effective way for float and double comparison

Convert Bytes to Floating Point Numbers in Python

A little diversion into floating point (im)precision, part 1