ansaurus

Question

How to convert a float to a non standard encoding

Answer 1

+2 A:

Don't get carried away about the internal representation of the float. Fixed-point values are just integers, with a constant scale factor. Just remember that you have more limited precision in floats than in your target format, so expected values may be off in the lower 9 bits for large values.

//s15Fixed16Number is presumably typedef'ed to unsigned int
float foo = 1.0f;
int fooFixedSigned = (int)(foo * 65536);
s15Fixed16Number fooFixed = (s15Fixed16Number)(abs(fooFixedSigned));
if (foo < 0) fooFixed = fooFixed | (1 << 31);
//you'll also need to explicitly check for overflows and underflows and handle them however is appropriate to your situation

Edit: corrected & to |

Alan 2010-08-30 04:19:04

As Alan has shown, fixed point values can be converted to and from floating point values by multiplying or dividing by the unit value. This format throws a small twist by specifying a sign bit.

DominicMcDonnell 2010-08-30 04:58:16

You should be using `long` rather than `int` - the former has at least 32 bits, whereas the latter is only guaranteed to have 16.

caf 2010-08-30 05:19:52

Sorry, nope. Close, but nope. The representation explicitly says it is 2's complement. You don't get that by taking the absolute value. Further, your attempt to use a bit-wise AND operator to set the sign bit will clear every bit *except* the sign bit.

RBerteig 2010-08-30 05:33:19

The original question specified an explicit sign bit and not two's complement. But you're right about the incorrect bitwise operator, that was a mistake.

Alan 2010-08-30 19:35:42

The question sort of does, but the specification it references doesn't, and the sample values clearly aren't signed magnitude. The question misunderstood the spec. Further, if you want to convert to a signed magnitude form, then you might also need to guarantee that overflow in the multiply doesn't set the sign bit for an out of range positive number.

RBerteig 2010-08-31 08:05:03

Thanks for the answer. This definitely helped me understand what's going on here. @Alan, you're correct in suggesting that I shouldn't get too carried away with the representation of a float. Thanks for the clarification.

jonc 2010-08-31 15:37:20

Answer 2

+1 A:

Assuming your C environment does 2's complement integers, then this is much simpler than it seems.

typedef long s1516;  // 32bit 2's complement signed integer
s1516 floattos1516(double f) {
    return (s1516)(f * 65536. + 0.5);
}

The representation is a fixed point value, with 16 bits of fraction. That is the same as a rational number whose denominator is always 65536 (or 2¹⁶). To form such a rational from a floating point value, you just multiply by the denominator. Then it is just a matter of an appropriate rounding, and a truncation to the integral type.

The standard picked the form they did because this just works if your system uses 2's complement integer arithmetic. Although it is true that the leftmost bit does represent the sign, it is not a sign bit in the sense that is used in a floating point representation.

If your calculations are truly float rather than double, you will find that you don't have as much precision in your calculation as is available in the fixed point value for numbers near full scale. If you calculate in double, then you will always have more precision in your calculation than in the result.

Edit:

The apparently latest spec is available from the ICC as Specification ICC.1:2004-10 (Profile version 4.2.0.0). Section 5.1.3:

5.1.3 s15Fixed16Number

A fixed signed 4-byte/32-bit quantity which has 16 fractional bits as shown in table 3.

Table 3 — s15Fixed16Number
  Number               Encoding
-32768,0               80000000h
     0                 00000000h
     1,0               00010000h
 32767 + (65535/65536) 7FFFFFFFh

Aside from localized preference for the representation of a decimal point, these values are completely consistent with my understanding that the representation is simply signed 2's complement integers that should be divided by 65536 to get their values.

The natural conversion to the representation is simply to multiply by 65536, and from it simply to divide. Picking a suitable rounding rule is a matter of preference.

The full scale range is from -32768.0 (0x80000000) to approximately 32767.9999847412 (0x7fffffff), inclusive.

I would agree that it would be clearer if the specification had happened to show the representation in hex of any negative values. I skimmed the entire document, and the only values I found represented in both decimal and hex were CIE XYZ chromaticity coordinates, which by definition range from 0 to 1, and hence don't help as exemplar negative values.

RBerteig 2010-08-30 05:31:42

Your code lacks the error checking to spot range problems when plugged into my test framework (as an extra column in the output). Also, the result is badly wrong when the input is negative (giving 0xFFFF0001 for -1.0). The rounding with +0.5 causes slight deviations from my answers, but yours may be better than mine because of it.

Jonathan Leffler 2010-08-30 05:53:58

-1.0 would be exactly 0xFFFF0000 assuming the encoding is what I understood it to be. Rounding up might not be the best answer, however. I should flag this particular fragment as untested, but I use fragments just like it to convert from floating point to fixed point regularly.

RBerteig 2010-08-30 06:32:41

@Jonathon, I think you are over thinking the spec. As I read it, it can only have the simple and natural meaning, and the quoted sample values are consistent.

RBerteig 2010-08-30 07:09:24

Thanks RBertig. I selected this answer because I realized you are probably correct in you're interpretation that the spec is talking about a Fixed point 2's compliment number. I will probably use a double to make sure I get the precision I need, and maybe add some error checking.

jonc 2010-08-31 15:39:21

ansaurus

tags:

views:

answers:

How to convert a float to a non standard encoding

related questions