ansaurus

Question

Generate identical Hashcodes for approximately-similar numbers

Answer 1

A:

Here's some code to show what I'm doing. Each pair of numbers in "original" should return the same value.

int tolerance = 3;
double[] original = new double[] {
0.69912839,
0.69990821,

0.69990821,
0.70000120,

12.980984087,
12.981808908
};
double[] modified = new double[original.Length];

for (int i = 0; i < original.Length; i++)
{
modified[i] = original[i];

/* Begin number adjustment logic */
modified[i] *= Math.Pow(10, tolerance);
modified[i] = Math.Truncate(modified[i]);

if (modified[i] % 2 != 0)
{
modified[i]++;
}
/* End number adjustment logic */

Console.WriteLine(modified[i]);

if (i % 2 != 0)
{
Console.WriteLine(string.Empty);
}
}

That above method is the "truncate to 3dp then adjust to the nearest even" method. Next is the truncate method (replace code between the begin/end comments):

/* Begin number adjustment logic */
modified[i] *= Math.Pow(10, tolerance);
modified[i] = Math.Truncate(modified[i]);
/* End number adjustment logic */

This is the round method:

/* Begin number adjustment logic */
modified[i] = Math.Round(modified[i], tolerance);
/* End number adjustment logic */

Andy Shellam 2010-02-12 09:18:02

Answer 2

+3 A:

It's unable to write correct hashcode. let proof it: we have 2 points. var a = point1.GetHashCode(); var b = point2.GetHashCode();

if a!= b let create point between point1 and point2. and so on.

After such operations we will create line, where each point is near to some other point and their hashcodes will be the same. So hash code for point1 and point2 should be equals.

So overrtide like this:

public override int GetHashCode()
{
    return 0;
}

and implement you equals.

Steck 2010-02-12 09:23:02

I think your conclusion is right, although I'm not sure I follow your logic. The only way to do this is to make ALL hashcodes return the same and rely on the subsequent call to Equals that will be done to determine if they are the same. This obviously means you can't use the class in a hashmap properly as they'll all end up in one bucket but may not be an issue.

Paolo 2010-02-12 09:27:39

These won't be used in a hashtable or dictionary as there's nothing unique about the locations to be keyed on - all I need is a list of them and work out which ones are at the same point on one or both axis.

Andy Shellam 2010-02-12 09:32:13

This worked fine but like Paolo, I didn't follow your logic, therefore I've marked fortran's answer as it made more sense to me. Thanks.

Andy Shellam 2010-02-12 09:41:52

Yeah, if we have to get the same hashcode to the near points we can't give different hashcodes and using hashmap is useless.Try to use something like k-means clustering and give hashcodes to the clusters.

Steck 2010-02-12 09:42:26

Answer 3

+1 A:

I guess that if you return always the same hash (say 0), LinQ will try to compare all elements with equals. After all, a hash is useful to prove that two elements are distinct, not equal.

But anyway, I would recommend you to use better suited structures and algorithms for this domain, like Binary Splitting Partition (BSP) trees (for example).

fortran 2010-02-12 09:23:51

Interesting - I was following the guidelines on GetHashcode implementations that state it must return the same hash if Equals() returns true, which I guess this is.

Andy Shellam 2010-02-12 09:30:38

This works great, thanks!

Andy Shellam 2010-02-12 09:44:04

that's the point :-) the semantics imply this syllogism: `A == B => hash(A) == hash(A)`, so if we got just the hashes we can deduce `!(hash(A) == hash(B)) => !(A == B)`, but not the complementary.

fortran 2010-02-12 09:45:53

Answer 4

+3 A:

I think you should not override Equals(), ==, != or GetHashCode()

If you override any of these you should make sure that their semantics do not change. In your example they do.

For instance your == cannot be transitive for it it is then if P1 is 0.001 mm from P2, P2 is 0.001 mm from P3 and P1 is 0.002 mm from P3 then P1 == P2, P2 == P3 and P1 == P3, which is not what you want. In general you end up with all points being equal to all other points.

I would just stick to using a separate method for determining if points are close enough.

EDIT

With your override of == you can now write code like this:

if(P1 == P2 && P2 == P3 && P1 != P3)
{
    // Code here gets executed
}

Peter van der Heijden 2010-02-12 09:37:05

If P1 is 0.002mm from P3, then when comparing P1 to P3, it'll return false. It'll only ever be comparing 2 points and is to check - "actually, are these two points close enough to be the same?"

Andy Shellam 2010-02-12 09:54:26

I have added code to try to clarify what I think is wrong with your overloaded == operator

Peter van der Heijden 2010-02-12 11:15:18

I don't need to compare 3 points though. Take 2 lines - one starts at the same point the other ends. Even though the points look the same, they're actually in the region of 0.00081mm apart, for example. So I just want to look at those 2 points and say "are they within 0.001mm of each other?" If they are, then they're taken as the same point and the lines (in my app) are joined. The overloaded == operator is working great, it's just getting the same hashcode for two points I consider the same that was making things tricky.

Andy Shellam 2010-02-12 13:08:28

I appreciate that what you have done satisfies your current requirements. However by creating a class that uses well known operators and methods in a manner that is not standard you run several risks. Other people will have a hard time understanding your code or using your class, you will have a hard time using your class in standard scenario's e.g. as key in a dictionary, if the code that uses your class grows in size you may run into hard to find bugs, etc.

Peter van der Heijden 2010-02-12 15:32:38

Answer 5

+3 A:

It would be easier just to remove the dependency on the Distinct method. Implement a System.Collections.IComparer (or the generic equivalent) and use a simple collection like a list. Then determine if the item is in the list with the comparer, and not add it if it already is contained.

MaLio 2010-02-12 09:45:22

I've also marked this as an answer as it actually seems more logical - don't know why I didn't think of it before! I had to use the List's BinarySearch method to be able to use IComparer - the Contains method requires an IEqualityComparer which guess what? It uses a hashcode!! I've implemented my own Distinct() method in my collection which is what I was leaning toward doing anyway.

Andy Shellam 2010-02-12 10:51:44

Answer 6

+1 A:

should I scrap the Equals, ==, != and GetHashcode overrides, and create my own MyPoint.IsEqualTo() and MyPointCollection.Distinct() methods?

Yes.

It needn't necessarily be some completely different data structure, though. In checking for a duplicate, you need to check the neighboring hashcodes, e.g. the hashes for (x+0.001,y), (x,y-0.001), and so on. This makes for just a constant-factor slowdown compared to the usual deduplication lookup, and it's not complicated, so it might be the way to go. (An obvious point, but I don't see it made explicitly here yet.)

Update: To clarify, let's look at a 1-dimensional version of the problem. The 'points' are single numbers, x. We consider x1 and x2 to match when abs(x1 - x2) < .001. Now we want to find out if x matches any of {x_0, ..., x_n}. The x_i are kept in a hashtable, where hash(x) = h(floor(1000*x)) for some function h() that spreads things around. To see if x is already in the table, we compute hash(x-.001), hash(x), and hash(x+.001), then we test if x matches any of the x_i in any of the three buckets. Any matching x_i cannot be in any other bucket.

In the 2-d variant, there are 9 neighboring buckets to check (counting the middle); in 3-d, 27.

Darius Bacon 2010-02-12 10:43:24

This was similar to what I was considering doing but I still needed a constant number that was the same to generate the hash on. For example, 0.69890821 and 0.7000821 (0.69990821 +/- 0.001) will still generate completely different hashes to 0.69912839 +/- 0.001 even though I must consider them to be the same point.

Andy Shellam 2010-02-12 11:00:36

I think I didn't make myself clear; I'll update my answer.

Darius Bacon 2010-02-12 21:58:58

Answer 7

+1 A:

This should be a clearer explanation of what Steck and Paolo said.

Suppose you did manage to write the GetHashCode method in the way you want.

Then for any points a and b, whatever the distance between them, a.GetHashCode() == b.GetHashCode().

Proof: suppose a < b. Split the distance between a and b into segments smaller than 0.001. I.e. a0 = a, a1 = a0 + 0.0005, a2 = a1 + 0.0005, etc. until you get to b.

Then a.GetHashCode() == a1.GetHashCode() == a2.GetHashCode() == ... == b.GetHashCode().

Alexey Romanov 2010-02-12 14:36:28

ansaurus

tags:

views:

answers:

Generate identical Hashcodes for approximately-similar numbers

related questions