views:

62

answers:

4

I need to store pairs of float,int in which the int value stores the number of occurrences of a float value inside a model I'm using for a tool I'm developing and I was wondering if it's safe to do such things..

Finite precision should be an issue when talking floats used to direct comparisons (or as content to be hashed) so I think that a similar approach is discouraged, am I right?

Actually the problem is that I don't have any other information coupled with these floats so I simply can't use anything else as a key for the hashtable but in the same time, since keys will be many, having a good performance would be nice.

Maybe the best solution is to use a binary search tree (or an even more advanced data structure) to obtain at least an average case of O(logn) also if a constant factor would be better.

Do you have any suggestion? Just to let you know I'm developing in OCaml but I think that these considerations can be considered language agnostic

A: 

Are you talking about a performance problem, or a validity problem?

For validity: If you want to count occurrences of identical floats, then there's no problem. If you want to count occurrences of floats that are approximately the same, you need to figure out what 'approximately the same' means for you.

David M.
+2  A: 

I think there are a couple of questions here

Is it safe to use floats as keys to a hash table?

Yes. I can't think of a language right now where floats do not fit the requirements needed for a key in a hash table (typically stable hash code and equality semantics)

Is it OK to have a hash table with lots of keys?

Depends on how many. If the number of keys is so great it causes the table to expand past the allowable memory size then certainly no as it will cause out of memory situations. It's really impossible to answer this part of the question without more context. Likely you're the only one who will be able to answer it.

Does the precision of float make it any worse than other types like int?

This is implementation specific but I believe in OCaml a float has double precision (8 bytes). So asking if the precision makes it invalid as a key is the equivalent of asking is the C# long type unsuitable as a hash table key. They both have the same number of possible values (they're both 8 bytes). I would certainly say long is a valid key type (used it often and there is nothing wrong with it).

I think the real question is are you irresponsibly creating instances of the float to use as a key.

If I'm running out of memory with a hash table will a binary tree be any better?

Possibly but not by a lot. There is overhead involved with both binary trees and hash tables. For hashtables it's typically the unused buckets and the next pointers in the lists inside the buckets. For a binary tree every element in the tree has 2 extra pieces of overhead (the left and right pointer). If you're running out of memory I'm not sure swvitching to a binary tree will be significantly better.

JaredPar
+2  A: 

The usual problem with floating-point numbers is that calculations are approximate. If you calculate the same value in two different ways, the results are likely to be very slightly different. (In some cases, you can get slight differences by calculating the same value twice the same way.)

Therefore, if you're doing any calculations on floats, you will get approximations, and shouldn't rely on equality. If your source was calculating floats in various ways, the data passed to you will be approximate. If you are getting exact floating-point values, and can count on any numbers that should be the same being the exact same bit representation, then equality works as normal, and you can use a hash table.

David Thornley
A: 

You might be OK, if you're sure that you intent to count the number of instances of an exact float value.

As David says, the inherent problem of a hashtable keyed on floats is that hashtables use equality to identify keys, and equality of floats is a slightly unreliable concept due to calculation errors. There is no general guarantee that sin(pi / 6) == 0.5, or even that (2.0 / 3) * (2.0 / 3) == (4.0 / 9). In both cases, the LHS might be a bit or more different from the RHS.

So, if some of the entries you're counting are entered as 0.5, and some are computed as sin(pi / 6), and you want them to be counted together, then you need to do more than just hash on a float value.

You might get away with rounding and then hashing, although you never escape the issue entirely. For instance if you round to the nearest 0.001, then you'll identify 0.2020001 and 0.2020003 as being "the same value, with a calculation error", but not the equally-close-together 0.1014999 and 0.1015001. I've used base-10 examples for ease of typing, but of course "float" generally means a binary representation.

Exactly the same issue would apply with a binary tree. Hashtables don't really care what their key data "is", they just care that someone can provide a function h which maps keys to integers, such that for any x and y you want to consider "equal", h(x) == h(y). Then for performance, you want h to introduce no more "collisions" (instances of h(x) == h(y) where x != y) than random chance. There's no obstacle at all to doing this with floats. You must ensure that you don't include anything in the hash that doesn't participate in comparisons, and it helps if you include all the information that does participate in comparisons.

If you can solve the question of what it is you're actually counting, then that might lead you to the data structure you need. If you do want some tolerance in the matches, you might just be best off sorting all your floats and then looking for clusters of values.

Steve Jessop