views:

808

answers:

11

I've just come across a scenario in my project where it I need to compare different tree objects for equality with already known instances, and have considered that some sort of hashing algorithm that operates on an arbitrary tree would be very useful.

Take for example the following tree:

        O
       / \
      /   \
     O     O
    /|\    |
   / | \   |
  O  O  O  O
          / \
         /   \
        O     O

Where each O represents a node of the tree, is an arbitrary object, has has an associated hash function. So the problem reduces to: given the hash code of the nodes of tree structure, and a known structure, what is a decent algorithm for computing a (relatively) collision-free hash code for the entire tree?

A few notes on the properties of the hash function:

  • The hash function should depend on the hash code of every node within the tree as well as its position.
  • Reordering the children of a node should distinctly change the resulting hash code.
  • Reflecting any part of the tree should distinctly change the resulting hash code

If it helps, I'm using C# 4.0 here in my project, though I'm primarily looking for a theoretical solution, so pseudo-code, a description, or code in another imperative language would be fine.


UPDATE

Well, here's my own proposed solution. It has been helped much by several of the answers here.

Each node (sub-tree/leaf node) has the following hash function:

public override int GetHashCode()
{
    int hashCode = unchecked((this.Symbol.GetHashCode() * 31 +
        this.Value.GetHashCode()));
    for (int i = 0; i < this.Children.Count; i++)
        hashCode = unchecked(hashCode * 31 + this.Children[i].GetHashCode());
    return hashCode;
}

The nice thing about this method, as I see it, is that hash codes can be cached and only recalculated when the node or one of its descendants changes. (Thanks to vatine and Jason Orendorff for pointing this out).

Anyway, I would be grateful if people could comment on my suggested solution here - if it does the job well, then great, otherwise any possible improvements would be welcome.

+6  A: 

Okay, after your edit where you've introduced a requirement that the hashing result should be different for different tree layouts, you're only left with option to traverse the whole tree and write its structure to a single array.

That's done like this: you traverse the tree and dump the operations you do. For an original tree that could be (for a left-child-right-sibling structure):

[1, child, 2, child, 3, sibling, 4, sibling, 5, parent, parent, //we're at root again
 sibling, 6, child, 7, child, 8, sibling, 9, parent, parent]

You may then hash the list (that is, effectively, a string) the way you like. As another option, you may even return this list as a result of hash-function, so it becomes collision-free tree representation.

But adding precise information about the whole structure is not what hash functions usually do. The way proposed should compute hash function of every node as well as traverse the whole tree. So you may consider other ways of hashing, described below.


If you don't want to traverse the whole tree:

One algorithm that immediately came to my mind is like this. Pick a large prime number H (that's greater than maximal number of children). To hash a tree, hash its root, pick a child number H mod n, where n is the number of children of root, and recursively hash the subtree of this child.

This seems to be a bad option if trees differ only deeply near the leaves. But at least it should run fast for not very tall trees.

If you want to hash less elements but go through the whole tree:

Instead of hashing subtree, you may want to hash layer-wise. I.e. hash root first, than hash one of nodes that are its children, then one of children of the children etc. So you cover the whole tree instead of one of specific paths. This makes hashing procedure slower, of course.

    --- O  ------- layer 0, n=1
       / \
      /   \
 --- O --- O ----- layer 1, n=2
    /|\    |
   / | \   |
  /  |  \  |
 O - O - O O------ layer 2, n=4
          / \
         /   \
 ------ O --- O -- layer 3, n=2

A node from a layer is picked with H mod n rule.

The difference between this version and previous version is that a tree should undergo quite an illogical transformation to retain the hash function.

Pavel Shved
Interesting suggestion. Not sure how much of a problem it might turn out to be, but these trees could potentially be quite deep and don't really have a maximal number of nodes (could be 10, 100, 1000, or even a bit larger).
Noldorin
hmm, why is everyone's answer so complex? You can generate an exact hashcode in 5 lines, and sample it to generate a shorter hashcode (see my answer below).
Larry Watanabe
@Larry, more tricky solutions are usually more complex than straightforward ones.
Pavel Shved
-1. The proposed approach (walking the tree to build a list) is too complicated. The problem has a simple, direct solution, given by vatine and pnm.
Jason Orendorff
@Jason Orenforff, (a) I proposed three approaches, (b) I don't think it's complicated. Wherther this is *my* problem, I'm not sure.
Pavel Shved
+6  A: 

The usual technique of hashing any sequence is combining the values (or hashes thereof) of its elements in some mathematical way. I don't think a tree would be any different in this respect.

For example, here is the hash function for tuples in Python (taken from Objects/tupleobject.c in the source of Python 2.6):

static long
tuplehash(PyTupleObject *v)
{
    register long x, y;
    register Py_ssize_t len = Py_SIZE(v);
    register PyObject **p;
    long mult = 1000003L;
    x = 0x345678L;
    p = v->ob_item;
    while (--len >= 0) {
        y = PyObject_Hash(*p++);
        if (y == -1)
            return -1;
        x = (x ^ y) * mult;
        /* the cast might truncate len; that doesn't change hash stability */
        mult += (long)(82520L + len + len);
    }
    x += 97531L;
    if (x == -1)
        x = -2;
    return x;
}

It's a relatively complex combination with constants experimentally chosen for best results for tuples of typical lengths. What I'm trying to show with this code snippet is that the issue is very complex and very heuristic, and the quality of the results probably depend on the more specific aspects of your data - i.e. domain knowledge may help you reach better results. However, for good-enough results you shouldn't look too far. I would guess that taking this algorithm and combining all the nodes of the tree instead of all the tuple elements, plus adding their position into play will give you a pretty good algorithm.

One option of taking the position into account is the node's position in an inorder walk of the tree.

Eli Bendersky
You make a good point in general. However, trees are slightly different to sequences in that they contain greater structure - a tree cannot simply be represent by a sequence (unless you specifically know it is a binary/ternary/etc. tree). Might be able to adapt the algorithm easily enough however...
Noldorin
A tree *can* be represented as a sequence. I showed in my answer and example of how.
Pavel Shved
The "position" incorporates the path information. For example, for every node assign a position value of 0 for the node itself and 1..n for each of its n children from left to right. When you visit child number i in your traversal, you include i in the hash. When you visit the node itself, include 0 and the node's hash contents. The choice of the constants 0, 1, ..., n was arbitrary and should instead be chosen based on domain-specific knowledge, e.g. maybe "0-mississippi", "1-mississippi", etc. will work better.
GregS
@Pavel Shved: It can indeed, yet a sequence is still an ambiguous representation. See this for example: http://pastebin.com/m44d5b6b6 (the same applies to a depth-first traversal)
Noldorin
@Noldorin, that's why it's important to add aditional symbols to sequence, so it will be of greater length than the original tree.
Pavel Shved
+1  A: 

I think you could do this recursively: Assume you have a hash function h that hashes strings of arbitrary length (e.g. SHA-1). Now, the hash of a tree is the hash of a string that is created as a concatenation of the hash of the current element (you have your own function for that) and hashes of all the children of that node (from recursive calls of the function).

For a binary tree you would have:

Hash( h(node->data) || Hash(node->left) || Hash(node->right) )

You may need to carefully check if tree geometry is properly accounted for. I think that with some effort you could derive a method for which finding collisions for such trees could be as hard as finding collisions in the underlying hash function.

Krystian
+2  A: 

I can see that if you have a large set of trees to compare, then you could use a hash function to retrieve a set of potential candidates, then do a direct comparison.

A substring that would work is just use lisp syntax to put brackets around the tree, write out the identifiere of each node in pre-order. But this is computationally equivalent to a pre-order comparison of the tree, so why not just do that?

I've given 2 solutions: one is for comparing the two trees when you're done (needed to resolve collisions) and the other to compute the hashcode.

TREE COMPARISON:

The most efficient way to compare will be to simply recursively traverse each tree in a fixed order (pre-order is simple and as good as anything else), comparing the node at each step.

  1. So, just create a Visitor pattern that successively returns the next node in pre-order for a tree. i.e. it's constructor can take the root of the tree.

  2. Then, just create two insces of the Visitor, that act as generators for the next node in preorder. i.e. Vistor v1 = new Visitor(root1), Visitor v2 = new Visitor(root2)

  3. Write a comparison function that can compare itself to another node.

  4. Then just visit each node of the trees, comparing, and returning false if comparison fails. i.e.

Module

 Function Compare(Node root1, Node root2)
      Visitor v1 = new Visitor(root1)
      Visitor v2 = new Visitor(root2)

      loop
          Node n1 = v1.next
          Node n2 = v2.next
          if (n1 == null) and (n2 == null) then
                return true
          if (n1 == null) or (n2 == null) then
                return false
          if n1.compare(n2) != 0 then
                return false
      end loop
      // unreachable
 End Function

End Module

HASH CODE GENERATION:

if you want to write out a string representation of the tree, you can use the lisp syntax for a tree, then sample the string to generate a shorter hashcode.

Module

 Function TreeToString(Node n1) : String
        if node == null
            return ""
        String s1 = "(" + n1.toString()
        for each child of n1
            s1 = TreeToString(child)

        return s1 + ")"
 End Function

The node.toString() can return the unique label/hash code/whatever for that node. Then you can just do a substring comparison from the strings returned by the TreeToString function to determine if the trees are equivalent. For a shorter hashcode, just sample the TreeToString Function, i.e. take every 5 character.

End Module

Larry Watanabe
In my situation, I'll often be comparing various trees against the *same* other tree many times. In this case, surely it would be more efficient to compute the hash, since you'll only need to recurse over the common tree's nodes once?
Noldorin
I got that - that's why i modified my answer to include the hash code generator. You can simply write the tree out as a string, then sample it. If you keep every character, you are guaranteed no collisions, but less efficient. every other character, more efficient, possible collisions. You choose the tradeoff based on your application, data, etc.
Larry Watanabe
@Larry: Ah, that's a novel approach. +1 It's certainly simple and effective I would think... though perhaps not the most efficient. WIll have a think about it.
Noldorin
Well, you can always optimize it easily enough by skipping characters instead of actually generating them. As long as the skip function is consistent, it would still return a valid hash code. Get it right first, optimize later :) This simple approach is easy to optmize and extend.If you look at the other approaches you will see that they are just using a lazy generation method for the hash code, but it is a little more ad hoc than this approach and much less simple. Start with simple and correct, then optimize later.
Larry Watanabe
A: 
class TreeNode
{
  public static QualityAgainstPerformance = 3; // tune this for your needs
  public static PositionMarkConstan = 23498735; // just anything
  public object TargetObject; // this is a subject of this TreeNode, which has to add it's hashcode;

  IEnumerable<TreeNode> GetChildParticipiants()
  {
   yield return this;

   foreach(var child in Children)
   {
    yield return child;

    foreach(var grandchild in child.GetParticipiants() )
     yield return grandchild;
  }
  IEnumerable<TreeNode> GetParentParticipiants()
  {
   TreeNode parent = Parent;
   do
    yield return parent;
   while( ( parent = parent.Parent ) != null );
  }
  public override int GetHashcode()
  {
   int computed = 0;
   var nodesToCombine =
    (Parent != null ? Parent : this).GetChildParticipiants()
     .Take(QualityAgainstPerformance/2)
    .Concat(GetParentParticipiants().Take(QualityAgainstPerformance/2));

   foreach(var node in nodesToCombine)
   {
    if ( node.ReferenceEquals(this) )
      computed = AddToMix(computed, PositionMarkConstant );
    computed = AddToMix(computed, node.GetPositionInParent());
    computed = AddToMix(computed, node.TargetObject.GetHashCode());
   }
   return computed;
  }
}

AddToTheMix is a function, which combines the two hashcodes, so the sequence matters. I don't know what it is, but you can figure out. Some bit shifting, rounding, you know...

The idea is that you have to analyse some environment of the node, depending on the quality you want to achieve.

George Polevoy
@George: Mind commenting/explaining that code a bit more please?
Noldorin
A: 

I have to say, that you requirements are somewhat against the entire concept of hashcodes.

Hash function computational complexity should be very limited.

It's computational complexity should not linearly depend on the size of the container (the tree), otherwise it totally breaks the hashcode-based algorithms.

Considering the position as a major property of the nodes hash function also somewhat goes against the concept of the tree, but achievable, if you replace the requirement, that it HAS to depend on the position.

Overall principle i would suggest, is replacing MUST requirements with SHOULD requirements. That way you can come up with appropriate and efficient algorithm.

For example, consider building a limited sequence of integer hashcode tokens, and add what you want to this sequence, in the order of preference.

Order of the elements in this sequence is important, it affects the computed value.

for example for each node you want to compute:

  1. add the hashcode of underlying object
  2. add the hashcodes of underlying objects of the nearest siblings, if available. I think, even the single left sibling would be enough.
  3. add the hashcode of underlying object of the parent and it's nearest siblings like for the node itself, same as 2.
  4. repeat this to with the grandparents to a limited depth.

    //--------5------- ancestor depth 2 and it's left sibling;
    //-------/|------- ;
    //------4-3------- ancestor depth 1 and it's left sibling;    
    //-------/|------- ;
    //------2-1------- this;
    

    the fact that you are adding a direct sibling's underlying object's hashcode gives a positional property to the hashfunction.

    if this is not enough, add the children: You should add every child, just some to give a decent hashcode.

  5. add the first child and it's first child and it's first child.. limit the depth to some constant, and do not compute anything recursively - just the underlying node's object's hashcode.

    //----- this;
    //-----/--;
    //----6---;
    //---/--;
    //--7---;
    

This way the complexity is linear to the depth of the underlying tree, not the total number of elements.

Now you have a sequence if integers, combine them with a known algorithm, like Ely suggests above.

1,2,...7

This way, you will have a lightweight hash function, with a positional property, not dependent on the total size of the tree, and even not dependent on the tree depth, and not requiring to recompute hash function of the entire tree when you change the tree structure.

I bet this 7 numbers would give a hash destribution near to perfect.

George Polevoy
-1. All hash code algorithms I know of use all the data. Hash codes are simply cached to make calculating them constant-time in practice.
Jason Orendorff
A: 

A simple enumeration (in any deterministic order) together with a hash function that depends when the node is visited should work.

int hash(Node root) {
  ArrayList<Node> worklist = new ArrayList<Node>();
  worklist.add(root);
  int h = 0;
  int n = 0;
  while (!worklist.isEmpty()) {
    Node x = worklist.remove(worklist.size() - 1);
    worklist.addAll(x.children());
    h ^= place_hash(x.hash(), n);
    n++;
  }
  return h;
}

int place_hash(int hash, int place) {
  return (Integer.toString(hash) + "_" + Integer.toString(place)).hash();
}
Keith Randall
I don't think this satisfies the requirement of distinguishing trees that have the same preorder traversal but different structures.
Jason Orendorff
I don't think that was listed as a requirement. If you want that, you could add node depth to the hash. I suspect preorder index plus node depth would determine a unique tree.
Keith Randall
@Keith: Jason is right - order of traversal is not enough - structure needs to be accounted for too.
Noldorin
+4  A: 

Any time you are working with trees recursion should come to mind:

public override int GetHashCode() {
    int hash = 5381;
    foreach(var node in this.BreadthFirstTraversal()) {
        hash = 33 * hash + node.GetHashCode();
    }
}

The hash function should depend on the hash code of every node within the tree as well as its position.

Check. We are explicitly using node.GetHashCode() in the computation of the tree's hash code. Further, because of the nature of the algorithm, a node's position plays a role in the tree's ultimate hash code.

Reordering the children of a node should distinctly change the resulting hash code.

Check. They will be visited in a different order in the in-order traversal leading to a different hash code. (Note that if there are two children with the same hash code you will end up with the same hash code upon swapping the order of those children.)

Reflecting any part of the tree should distinctly change the resulting hash code

Check. Again the nodes would be visited in a different order leading to a different hash code. (Note that there are circumstances where the reflection could lead to the same hash code if every node is reflected into a node with the same hash code.)

Jason
@Jason: Thanks for the reply. This indeed a nice simple solution - it's what first game to my mind, yet it fails the condition I propose here: http://pastebin.com/m44d5b6b6 (Sorry for note stating this in my original question).
Noldorin
This has a bug. Instead of `this.InOrderTraversal()` it should say `this.ChildNodes`. Otherwise each node will be visited 2^(n-1) times where n is the number of ancestors it has...
Jason Orendorff
@Noldorin: You're right, and a depth-first search would have a similar problem. Consequently I think you're going to need to encode the path that you take into the process.
Jason
@Jason Orendorff: Huh? In a tree traversal every node is visited once.
Jason
@Jason: Yeah, this is reasonably similar to my own suggested solution. I'm wondering, is there any real advantage of initialising `hash` to a non-zero (prime) value?
Noldorin
+13  A: 

If I were to do this, I'd probably do something like the following:

For each leaf node, compute the concatenation of 0 and the hash of the node data.

For each internal node, compute the concatenation of 1 and the hash of any local data (NB: may not be applicable) and the hash of the children from left to right.

This will lead to a cascade up the tree every time you change anything, but that MAY be low-enough of an overhead to be worthwhile. If changes are relatively infrequent compared to the amount of changes, it may even make sense to go for a cryptographically secure hash.

Edit1: There is also the possibility of adding a "hash valid" flag to each node and simply propagate a "false" up the tree (or "hash invalid" and propagate "true") up the tree on a node change. That way, it may be possible to avoid a complete recalculation when the tree hash is needed and possibly avoid multiple hash calculations that are not used, at the risk of slightly less predictable time to get a hash when needed.

Edit3: The hash code suggested by Noldorin in the question looks like it would have a chance of collisions, if the result of GetHashCode can ever be 0. Essentially, there is no way of distinguishing a tree composed of a single node, with "symbol hash" 30 and "value hash" 25 and a two-node tree, where the root has a "symbol hash" of 0 and a "value hash" of 30 and the child node has a total hash of 25. The examples are entirely invented, I don't know what expected hash ranges are so I can only comment on what I see in the presented code.

Using 31 as the multiplicative constant is good, in that it will cause any overflow to happen on a non-bit boundary, although I am thinking that, with sufficient children and possibly adversarial content in the tree, the hash contribution from items hashed early MAY be dominated by later hashed items.

However, if the hash performs decently on expected data, it looks as if it will do the job. It's certainly faster than using a cryptographic hash (as done in the example code listed below).

Edit2: As for specific algorithms and minimum data structure needed, something like the following (Python, translating to any other language should be relatively easy).

#! /usr/bin/env  python

import Crypto.Hash.SHA

class Node:
    def __init__ (self, parent=None, contents="", children=[]):
        self.valid = False
        self.hash = False
        self.contents = contents
        self.children = children


    def append_child (self, child):
        self.children.append(child)

        self.invalidate()

    def invalidate (self):
        self.valid = False
        if self.parent:
            self.parent.invalidate()

    def gethash (self):
        if self.valid:
            return self.hash

        digester = crypto.hash.SHA.new()

        digester.update(self.contents)

        if self.children:
            for child in self.children:
                digester.update(child.gethash())
            self.hash = "1"+digester.hexdigest()
        else:
            self.hash = "0"+digester.hexdigest()

        return self.hash

    def setcontents (self):
        self.valid = False
        return self.contents
Vatine
+1. This is the right answer. Calculating this can be O(1) amortized, because you can just cache at each node the hash of the subtree from there. (When changes are made, you can walk up the tree, just marking each cached hash code as invalid rather than recalculating them. This way when many changes are made one after another, you don't have to walk up the tree every time.)
Jason Orendorff
Good suggestion; at the least, for the proposal of cascading changes up the tree and caching hache codes.
Noldorin
+1, but I have a modification to suggest, assuming that node hashcode calculation has a measurable cost: instead of recalculating the cached hashcodes on change, invalidate them. You have to walk up the tree either way, but there's no need to recalculate your hashcodes until they're called for, so if you get multiple updates between comparisons, you only pay the recalculation cost once per comparison.
CPerkins
+1 for correct answer. Poster asked for the best theoretical solution, and this is exactly how they do it in cryptography papers. If security is not an issue, then you can probably just get away with concatenating all the values (and their numeric positions) and hashing that, for a very-quick, *usually-collision-free-assuming-no-malicious-users* hash.
BlueRaja - Danny Pflughoeft
Yeah, I'm definitely leaning towards accepting this answer. If you could make any comments/suggestions on the specific algorithm of my own proposed solution, would appreciate that and would definitely give you the answer.
Noldorin
@vatine: Thanks for the extra information in your edits. This seem like a pretty complete answer now, so the bounty is yours. :)
Noldorin
+4  A: 

The collision-free property of this will depend on how collision-free the hash function used for the node data is.

It sounds like you want a system where the hash of a particular node is a combination of the child node hashes, where order matters.

If you're planning on manipulating this tree a lot, you may want to pay the price in space of storing the hashcode with each node, to avoid the penalty of recalculation when performing operations on the tree.

Since the order of the child nodes matters, a method which might work here would be to combine the node data and children using prime number multiples and addition modulo some large number.

To go for something similar to Java's String hashcode:

Say you have n child nodes.

hash(node) = hash(nodedata) +
             hash(childnode[0]) * 31^(n-1) +
             hash(childnode[1]) * 31^(n-2) +
             <...> +
             hash(childnode[n])

Some more detail on the scheme used above can be found here: http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/

A: 

Writing your own hash function is almost always a bug, because you basically need a degree in mathematics to do it well. Hashfunctions are incredibly nonintuitive, and have highly unpredictable collision characteristics.

Don't try directly combining hashcodes for Child nodes -- this will magnify any problems in the underlying hash functions. Instead, concatenate the raw bytes from each node in order, and feed this as a byte stream to a tried-and-true hash function. All the cryptographic hash functions can accept a byte stream. If the tree is small, you may want to just create a byte array and hash it in one operation.

BobMcGee