views:

241

answers:

9

What's the best heuristic I can use to identify whether a chunk of X 4-bytes are integers or floats? A human can do this easily, but I wanted to do it programmatically.

I realize that since every combination of bits will result in a valid integer and (almost?) all of them will also result in a valid float, there is no way to know for sure. But I still would like to identify the most likely candidate (which will virtually always be correct; or at least, a human can do it).

For example, let's take a series of 4-bytes raw data and print them as integers first and then as floats:

1           1.4013e-45
10          1.4013e-44
44          6.16571e-44
5000        7.00649e-42
1024        1.43493e-42
0           0
0           0
-5          -nan
11          1.54143e-44

Obviously they will be integers.

Now, another example:

1065353216  1
1084227584  5
1085276160  5.5
1068149391  1.33333
1083179008  4.5
1120403456  100
0           0
-1110651699 -0.1
1195593728  50000

These will obviously be floats.

PS: I'm using C++ but you can answer in any language, pseudo code or just in english.

+1  A: 

You can probably "detect" it by looking at the high bits, with floats they'd generally be non-zero, with integers, they would be unless you're dealing with a very large number. So... you could try and see if (2^30) & number returns 0 or not.

WoLpH
A: 

You are going to be looking at the upper 8 or 9 bits. That's where the sign and mantissa of a floating point value are. Values of 0x00 0x80 and 0xFF here are pretty uncommon for valid float data.

In particular if the upper 9 bits are all 0 then this likely to be a valid floating point value only if all 32 bits are 0. Another way to say this is that if the exponent is 0, the mantissa should also be zero. If the upper bit is 1 and the next 8 bits are 0, this is legal, but also not likely to be valid. It represents -0.0 which is a legal floating point value, but a meaningless one.

To put this into numerical terms. if the upper byte is 0x00 (or 0x80), then the value has a magnitude of at most 2.35e-38. Plank's constant is 6.62e-34 m2kg/s that's 4 orders of magnitude larger. The estimated diameter of a proton is much much larger than that (estimated at 1.6e−15 meters). The smallest non-zero value for audio data is about 2.3e-10. You aren't likely to see floating point values are are legitimate measurements of anything real that are smaller than 2.35e-38 but not zero.

Going the other direction if the upper byte is 0xFF then this value is either Infinite, a NaN or larger in magnitude than 3.4e+38. The age of the universe is estimated to be 1.3e+10 years (1.3e+25 femtoseconds). The observable universe has roughly e+23 stars, Avagadro's number is 6.02e+23. Once again float values larger than e+38 rarely show up in legitimate measurements.

This is not to say that the FPU can't load or produce such values, and you will certainly see them in intermediate values of calculations if you are working with modern FPUs. A modern FPU will load a floating point value that has a exponent of 0 but the other bits are not 0. These are called denormalized values. This is why you are seeing small positive integers show up as float values in the range of e-42 even though the normal range of a float only goes down to e-38

An exponent of all 1s represents Infinity. You probably won't find infinities in your data, but you would know better than I. -Infinity is 0xFF800000, +Infinity is 0x7F800000, any value other than 0 in the mantissa of Infinity is malformed. malformed infinities are used as NaNs.

Loading a NaN into a float register can cause it to throw an exception, so you want to use integer math to do your guessing about whether your data is float or int until you are fairly certain it is int.

John Knoeller
You are completely wrong on every assertion I see here. The sign+exponent are contained in the upper *9* bits. 0x00 is always valid; if the next bit is 0 then the number is denormalized. 0x80 indicates a small negative value. 0xff precedes any of a large negative value, negative infinity, or NAN.
Potatoswatter
`0x80` is absolutely valid as the upper byte of a float (specifically, it's the upper byte of negative zero or a very small negative number). More generally, **every** 32 bit integer is a valid float encoding (some are NaNs, but those are still valid encodings). `0xFF` is actually the high byte of a very *large* negative number (or -infinity, or NaN). `0x00` is a valid upper byte too (of zero or a small positive number).
Stephen Canon
@Patatoswatter: You're right - what I said was more heuristic than technical. It is the upper 9 bits, and denormalized floats are legal values that are tolerated by the FPU. But they aren't _normal_ and thus can be used as a reasonableness check. In light of your objections, I expanded the answer.
John Knoeller
@Stephen Canon: It depends on what you mean by valid. What I mean is _reasonable to find in actual data_. I'll grant you that every 32 bit value has a defined meaning to the FPU, but some bit patterns don't show up in floating point values that have come _out_ of the FPU, and others are unlikely to appear in data that has been generated by some process other than a random number generator.
John Knoeller
I'm sorry, but this is still incorrect. (Most) FPUs can produce denormalized values as the result of arithmetic or conversions, so it is entirely possible to get denormalized values "out of the FPU". More generally, the word *valid* has a common meaning in English, and it isn't "expected". While some FPUs do not generate some bit patterns (generally a subset of the possible NaN encodings), there is no standard that guarantees this will be true for all FPUs (and indeed, it isn't).
Stephen Canon
@Stephen Canon: I did some research and you are right, most modern FPUs will generate denormaled values for values very near 0 that can't be represented by normalized values. I still think checking for 0 in the exponent is a valid heuristic, but I retract my statement about not being able to get those values out of the FPU.
John Knoeller
I agree that it's a valid heuristic (if perhaps a bit overbroad). I'm was only trying to correct some misstatements about floating point. I think the general idea of your suggestion is on target.
Stephen Canon
@John: there is also a problem with the overall reasoning. When you do engeeneering computing using matrixes, or many numerical algorithms you can get valid float numbers with very small exponants. What it means is not that this is really an integer, but that algorithms used in numerical computing have built-in errors. Hence it seems very dangerous to rely on the above reasoning for any value that is the result of some computing and not data entered by user from real world.
kriss
@kriss: it seems unlikely that subnormal values will be larger than your noise level. Just because you get a number out of a calculation doesn't mean that the number is _valid_ in an engineering sense. But if you know for certain that you have valid data down in the e-40 range, then you can't use _any_ heuristic, so this discussion wouldn't apply to you.
John Knoeller
@John: I agree with you. I'm just saying that some numerical algorithms give results within the noise level and it's ok. In such case above reasoning on limit values doesn't hold. Some numerical algorithms also play with non physical world values (like when zooming on fractal images), or computing with more than 3 dimensions. The above reasoning suppose values you have are count of something from the real world and I find it as a very restrictive hypothesis. Basically the above reasoning states that IEEE float numbers reserved two many digits to encode exponants, and I can't agree with that.
kriss
@John: Another thing that is bothering me is that it's an half reasoning: it only check if it's a possible float or not and forget that the other tested alternative is that it is an integer.
kriss
+9  A: 

The "common sense" heuristic from your example seems to basically amount to a range check. If one interpretation is very large (or a tiny fraction, close to zero), that is probably wrong. Check the exponent of the float interpretation and compare it to the exponent that results from a proper static cast of the integer interpretation to a float.

Alan
This is safe if you do integer comparisons. If you do float comparisons you risk loading a NaN and either getting an exception or unexpected results from your compare operations.
John Knoeller
If you want to compare only the exponents, then you need to mask out the bits and compare as an integer. Float comparison would not be involved.
Alan
+1  A: 

If both numbers are positive, your floats are reasonably large (greater than 10^-42), and your ints are reasonably small (less than 8*10^6), then the check is pretty simple. Treat the data as a float and compare to the least normalized float.

union float_or_int {
    float f;
    int32_t i;
};

bool is_positive_normalized_float( float_or_int &u ) {
    return u.f >= numeric_limits<float>::min();
}

This assumes IEEE float and same endinanness between the CPU and the FPU.

Potatoswatter
A: 

If you know that your floats are all going to be actual values (no NaNs, INFs, denormals or other aberrant values) then you can use this a criterion. In general an array of ints will have a high probability of containing "bad" float values.

Paul R
+2  A: 

A human can do this easily

A human can't do it at all. Ergo neither can a computer. There are 2^32 valid int values. A large number of them are also valid float values. There is no way of distinguishing the intent of the data other than by tagging it or by not getting into such a mess in the first place.

Don't attempt this.

EJP
+3  A: 

Looks like a kolmogorov complexity issue. Basically, from what you show as example, the shorter number (when printed as string to be read by a human), be it integer or float, is the right answer for your heuristic.

Also, obviously if the value is an incorrect float, it is an integer :-)

Seems direct enough to implement.

kriss
A: 

I assume the following:

  • that you mean IEEE 754 single precision floating point numbers.
  • that the sign bit of the float is saved in the MSB of an int.

So here we go:

static boolean probablyFloat(uint32_t bits) {
  bool sign = (bits & 0x80000000U) != 0;
  int exp = ((bits & 0x7f800000U) >> 23) - 127;
  uint32_t mant = bits & 0x007fffff;

  // +- 0.0
  if (exp == -127 && mant == 0)
    return true;

  // +- 1 billionth to 1 billion
  if (-30 <= exp && exp <= 30)
    return true;

  // some value with only a few binary digits
  if ((mant & 0x0000ffff) == 0)
    return true;

  return false;
}

int main() {
  assert(probablyFloat(1065353216));
  assert(probablyFloat(1084227584));
  assert(probablyFloat(1085276160));
  assert(probablyFloat(1068149391));
  assert(probablyFloat(1083179008));
  assert(probablyFloat(1120403456));
  assert(probablyFloat(0));
  assert(probablyFloat(-1110651699));
  assert(probablyFloat(1195593728));
  return 0;
}
Roland Illig
A: 

simplifying what Alan said, I'd ONLY look at the integer form. and say, if the number is bigger than 99999999 then it's almost definitely a float.

This has the advantage that it's fast, easy, and avoids nan issues.

It has the disadvantage that it pretty much full of crap... i didn't actually look at what floats these will represent or anything, but it looks reasonable from your examples...

In any case, this is a heuristic, so it's GONNA be full of crap, and not always work anyway...

Measure with a micrometer, mark with chalk, cut with an axe.

Brian Postow