tags:

views:

1257

answers:

6

Hi,

I would like to know what is the fastest way in java 1.5 to compare two data structures.

My data structure represents a tree that can be pretty big. I can traverse the whole data structure and compare the 2 node by node (which I guess will be slow). Or I can compute a hash of the data structure to do it faster, right ?

What is the best (efficient and not too long) way to compute this hash ?

I wouldn't like to need too much time to compute hash...

Hope I'm clear.. :-)...

+1  A: 

Every object inherits .equals() and .hashCode() from Object.

The standard data-structures in Java should already implement a relatively fast .hashCode() method for you (The hash might be incrementally calculated or might require iterating over each element, check the source of the data-structure you're using to be sure).

You should be aware that hash collisions might occur even if the data-structures aren't identical.

To have an accurate comparison I would perform a tree traversal simultaneously on both tree comparing each element. This way the shape of the tree as well as the contained elements would be compared in O(n) time where n is the size of the largest tree.

Ben S
It's not necessarily fast. The collections implementation depends in turn on the implementation of the elements in the collection.
erickson
Yes, but given a set of all possible hash algorithms, I'm sure the Java developers chose a comparatively fast one.
Ben S
Keep in mind that the default .equals() implementation is just checking "==" (are the two objects in fact the same object). Your classes should override equals() to do a meaningful comparison.
Scott Stanchfield
+1  A: 

To compute a hash, you have to traverse both trees entirely. You have to examine the properties of each node and perform the hash computation. For example, if a String is in the node, you have to iterate over its characters and do some math. Then you have to combine the hash of the node with the hash of the others.

So, computing a hash value for two structures is of the same order (probably a little more expensive) as comparing them for equality one time. In fact, because when performing an equality comparison, you can stop as soon as you detect any difference, a single equality test is going to be much faster, on average.

Hashing is likely to be beneficial only if you cache the hash value and reuse it many times. And remember, because hash values for different trees can collide, you still need to have equality comparison implemented.

erickson
I plan to cache some hash numbers... If two hash numbers are identical, i still have to traverse the trees and compare them node by node ?
LB
Yes, you still have to compare the trees for equality. There is such a thing as a "perfect" hash function that won't generate any collisions over a specified range, but I don't think you could devise one for a tree structure. Two different trees could have the same hash value. If you have a large collection of trees and you are looking for a match, a good hash code will narrow down the candidates. Then you do a node-by-node comparison to see if any of those on your short-list is really a match.
erickson
+2  A: 

Have you considered keeping a running hashCode which is continually updated as elements are inserted or removed from your trees? This way, comparing a tree at any given time by hashCode will be instantaneous.

Depending on how you implement your hash function, and how frequently you insert and remove nodes, this could be a horrible solution. If your hash function is fast, you aren't making many changes, and you need to do a lot of comparisons, this could work.

gdm
"this could be a horrible solution." I agree.
Ben S
+1  A: 

As gdm says, you could keep a running hashCode, which will allow you to quickly determine whether two trees are different (you'd then need to do a deep comparison once you've determined that they have same hash). You could use the xor (for example) of node.hashCode for all nodes, which makes adding and removing a very simple calculation:

this.hashcode ^= nodeInQuestion.hashCode;

Alternatively, you could create an immutable structure that you can intern. Again, it adds overhead to changes, but no comparison is faster than a reference equality. It depends whether you're optimising for modification or comparison, whether you need a similar speed for positives and negatives, and most importantly whether the size of your trees is actually noticeable.

Mark
+1  A: 

Depending on how expensive it is to compare the nodes, it might be worth to first only compare the topology of the tree and only if the tree structures are identical compare each pair of nodes.

jdisk
A: 

If all objects in the graph implement comparable - you can just call compareTo. Where possible I always implement comparable (as well as hashcode and equals) on POJOS.

To speed this up you can implement shortcuts so that objects that don't match return as early as possible. We do this and it really helps.

I wouldn't try and prematurely optimise other methods this until you have run a real profiler over it (Netbeans is free and has a very nice profiler).

The nice thing about adding compareTo is that it gives you a general purpose feature that is useful elsewhere: TreeMaps, sorted collections etc

Fortyrunner