views:

729

answers:

6

I am well aware of all the problems involved in comparing floats. This is exactly the reason for this question.
I'm looking to create a fast hash table for values that are 3D vectors (3 floats - x,y,z). It can be assumed that the length of the vector is always 1.0 (sqrt(x*x+y*y+z*z) is 1.0)

Essentially this means that I'm looking for a hash function that takes values that are almost equal to the same unsigned int value and a corresponding equality operator that is true if the hash values are equal (not not necessarily only if they are equal)

Edit -
False positives (i.e. vectors that are different but map to the same bucket) are a given since this is a hash table.
False negatives (i.e. Vectors that are close but map to different buckets) are undesirable but it seems that there is no way to avoid them. In my case, they will not causes total breakage, just some data duplication which is something I'll have to live with.

+2  A: 

I'd convert the float values into integers like this:

unsigned int IntValue = (int)(floatValue * MULT) + MULT;

so you get some of the first digits and then use

const MULT1 = (MULT << 1) + 1;
unsigned long long HashValue = (xIntValue * MULT1  * MULT1) + (yIntValue * MULT1) + zIntValue;

as a hash value (using (MULT * 2) + 1 because the IntValues will be between 0 and MULT * 2 inclusive).

The memory needed will be depending on the multiplicator MULT. For example, using 32 you'll get a hashtable using 64 * 64 * 64 * (Hash item size) = 262144 * (Hash item size) bytes.

schnaader
Just corrected the formula to support negative values, too.
schnaader
Using this scheme, you'd still get vectors that are very very close together but hash to different buckets - right at the edge of the rounding you're doing to compute IntValue.
Jesse Rusak
Of course, but I think the OP wants a quick way to compare vectors, not an exact way, or am I wrong?
schnaader
I think (and please correct me, if I'm wrong), that he wants a way that will only have false positives (saying they're equal when they're far away), not false negatives (saying they're different when they're close). At least, it sounds that way because he wants a hash table.
Jesse Rusak
Ah, now I get your point. You can't specify some "minimum same bucket distance" for my method (f.e. say all vectors that only differ by 1e-5 get into the same bucket).
schnaader
Yes, right. Much more clearly said than I managed. :)
Jesse Rusak
+3  A: 

I think what you're looking for is not directly possible. One important property of equality is that it is transitive. (ie. If a == b and b == c, then a == c). With a distance measure, though, you really don't want this property. Example:

Take a single float (for simplicity). Suppose we want to hash each float so that floats less than 1e-3 away are the same value. Now, suppose we add to this hash table 1000 float values all 1e-4 apart. Any neighboring 2 values should hash to the same float, since they are closer than 1e-3. However, because of transitivity, the neighbors of those values should also has to the same value, and their neighbors and so on. As a result, all 1000 values, including pairs farther than 1e-3 apart, would hash to the same integer. If you were to draw these points on a picture:

A  B  C  D  E  F  G  H ... Y Z

Suppose all the gaps are < 1e-3 apart, but A and Z are > 1e-3 apart (not to scale!). This can't be satisfied because if hash(A) == hash(B) and hash(B) == hash(C) and so on for all pairs, (since they are < 1e-3 apart) than hash(A) must == hash(Z).

One possible option is to define regions of your vector space in which all vectors would hash to the same value (ie round them before hashing them), but you could still get 2 vectors on the edges of their respective spaces that are close together but hash to a different value. You could get around that by searching all neighboring spaces for a vector. (ie in the 1-d case above, you'd round all vectors to the nearest multiple of 1e-3, and then search the neighbors, so 5.3e-3 would search 5e-3, 4e-3 and 6-e3. In higher-dimensional cases, you'd have to search neighbors in all dimensions.)

Jesse Rusak
This is an excellent point. thank you.
shoosh
A: 

Some languages (C, Java 5) allow you to access the binary value of your floats. This way, you can extract the first N bits of the mantissa (ignoring the last few bits which cause the trouble during compare) and calculate the hash from that.

Aaron Digulla
A: 

Can you ellaborate on your problem?

Assuming that you are using a hashmap to map some additional data to specific vectors, you could just use the XOR of the binary representations of the components (if this is possible in your language of choice). Then use as many LSBs (to reduce collisions) as you need for the hash map. This would of course have the property that two equal (by floating point comparison) vectors might not have the same hash (e.g. IEEE floating point 0 equals to -0, but they have different sign bit).

However, if you are planning on using vectors that are results of different computations to do hash lookup, you are setting yourself up to the possibility of not having matching hash codes due to rounding errors and you should probably be using something else anyway.

TrayMan
A: 

don't know how fast this might be, but since you have unit vectors, they all lie on the surface of a sphere. convert to a http://en.wikipedia.org/wiki/Spherical_coordinate_system. then use phi and theta to choose a bucket. there will be no false positives. you can look in the neighboring cells for false negatives.

Ray Tayek
Performing the conversion will introduce more rounding error. This might lead to some vectors ending up in the wrong bucket, depending on the bucket size.
Jørgen Fogh
A: 

Do you need it to be a fast hash table or would a tree structure do?

It seems to me that it would be easier to look up matching floats in a search tree of some sort. A B-Tree minimizes the number of cache misses, assuming that you choose the right node size. That ought to make it pretty fast in practice.

Jørgen Fogh