tags:

views:

1557

answers:

5

Hi..

I'd like to compare some large objects representing trees and cache something to avoid comparing each time the new object with one already existing...

The question is what would be the best something ? (a compromise between performance and collisions...).

On the one hand, I have a regular hashCode function based on the value of various fields (following the chapter 3 of effective Java. But I'm not able to evaluate the potential collisions entailed by such an approach.

On the other hand, I have the MessageDigest approach from the standard java distribution with SHA-1 algorithm. I presume it's not going to be efficient but I may have less collision. Am I right ? Is it a correct solution in my context or am I completely wrong ?

The thing is that I don't know what would be the size of the objects. Please also note that the value computed is not going to be used in a HashTable.

thx...

+6  A: 

Personally I would use hashCode() for the objects until it's been proven that any possible collisions are an actual problem to avoid preemptively optimizing a problem which you might not actually have.

matt b
Is there a way to evaluate the potential frequence/probability using hashCode() ?
LB
see Autocracy's link below however I don't really know the range of integers that Bloch's hashcode() implementation will return
matt b
+6  A: 

See the following:

Keep in mind the following:

  • An object may be unequal, yet have the same hash code
  • Your collisions potential depends on how many objects you encounter.
  • How useful hash codes will be depends on how you implement checking

Generally, you can determine the chance of a collision based upon the number of expected objects and the number of possible hashes (max hash value). See http://en.wikipedia.org/wiki/Birthday_paradox for the detailed explanation.

Personally? Java objects (instantiated classes) < 10,000? Hash code. Representing files / blobs / lots of data? SHA-1. I use SHA-1 hashing in my database to keep people from doing ETL work on the same file more than once. I then use SHA-1 hashing again at a second level to keep people from ETLing the same section in more than once file (e.g., different files but the same order shows up twice).

Autocracy
Oh, and specifically http://en.wikipedia.org/wiki/Birthday_paradox#Probability_Table which saves the math and shows you have a 1% chance of collision for 9,300 objects (hashCode returns a 32 bit integer)
Autocracy
+1  A: 

I'll endorse matt b's saying "don't optimize before you need to optimize."

However, should you decide you need something more than the hash code down the road... I used message digests (MD5 in my case) to "uniquely" identify various items downloaded from RSS feeds so I didn't end up with the same item appearing many times in the list as I polled over and over. Those were typically small postings so the digest could be calculated quickly. In my experience it was very effective and worked well.

Since they normally are one way functions meant to react strongly to even very small changes in the input data, you are definitely less likely to get collisions with MD5 or SHA-1.

John Munsch
+4  A: 

Because of the birthday problem, the chance of a collision depends on how many items you are working with.

The 160-bit space of SHA-1 is so large that I doubt you could ever have enough items to see a collision.

The 32-bit space of hashCode() should not have a significant number of collisions until you have over 50,000 items. However, this depends on using a good hash algorithm.

In order to apply a cryptographic digest like SHA-1, you'll need to convert your graph to a string of bytes, which is likely to be computationally expensive, and could be complicated.

erickson
+3  A: 

Usually for duplicate file/data detection, MD5 is a good tradeoff between speed and chance of collision. MD5 is inappropriate if somebody could be deliberately crafting files to fool your program (it is slightly vulnerable to collision attacks). But if you're just worried about collisions by chance, then its 128-bit width is practically always sufficient at present.

SHA-1 and SHA-256 give you some protection against deliberate collision attacks (theoretical but no practical attacks with SHA-1 are known; for keying data, it's rarely worth going beyon a 160-bit hash code width). SHA-1 is roughly half the speed of MD5.

Certainly if you use MD5, performance probably shouldn't be too much of an issue. But obviously this does depend on the size of your data. You may be interested in some information I put together about performance of secure hash functions in Java.

If you really do need something faster and you're only dealing with a few million items of data, then another option to consider is the 64-bit hash algorithm proposed by the Numerical Recipes authors.

Java's standard hashCode() implementation (of, say, String) is probably not suitable: aside from any issues about the quality of the hash, its 32-bit width means that you'll expect a collision after just 16,000 items or so.

Neil Coffey