views:

92

answers:

4

I'm building a list of hashes that represent root to node paths in a tree. My functions work but they are incredibly slow over large tree structures - is there a better way? I've tried building the list in one function but I get unique hashes where I don't want them.

public ArrayList<Integer> makePathList(AbstractTree<String> tree){
 StringBuilder buffer = new StringBuilder();
 ArrayList<Integer> pl = new ArrayList<Integer>();
 ArrayList<StringBuilder> paths = getPaths(tree, buffer);
 for(StringBuilder sb : paths){
  pl.add(sb.toString().hashCode());
 }

 return pl;
}

public ArrayList<StringBuilder> getPaths(AbstractTree<String> tree, StringBuilder parent){
        ArrayList<StringBuilder> list = new ArrayList<StringBuilder>(); 
        parent.append("/");
        parent.append(tree.getNodeName());
        list.add(new StringBuilder(parent));

        if (!tree.isLeaf()){    
            int i = 0;
            Iterator<AbstractTree<String>> child = tree.getChildren().iterator();
            while (i < tree.getChildren().size()){  
                list.addAll(getPaths(child.next(), new StringBuilder(parent)));
                i++;
            }  
        }
        return list;
}

UPDATE:

Marcin's suggestion to make the hash during tree traversal gives the wrong answer, but perhaps that is the way I have done it?

public ArrayList<Integer> getPaths(AbstractTree<String> tree, StringBuilder parent){
 ArrayList<Integer> list = new ArrayList<Integer>();

 parent.append("/");
 parent.append(tree.getNodeName());
 list.add(new StringBuilder(parent).toString().hashCode());

 if (!tree.isLeaf()){ 
  int i = 0;
  Iterator<AbstractTree<String>> child = tree.getChildren().iterator();
     while (i < tree.getChildren().size()){

      list.addAll(getPaths(child.next(), new StringBuilder(parent)));
      i++;
     }  
 }
 return list;
}
A: 

Where does jvisualvm indicate that the performance bottleneck is?

Thorbjørn Ravn Andersen
I don't know how to use jvisualvm, but I have timed the methods, using a 100MB XML tree. making paths... Done [3614ms]creating hash codes... Done [962ms] Total Done [4576ms]
Robert
It won't identify the core problem in this case, but you really should learn how to use a profiler such as visualvm. It's the only professional way to attack performance problems.
Michael Borgwardt
I will strongly recommend learning how to use a profiler. The lowest hanging fruit there is jvisualvm.
Thorbjørn Ravn Andersen
@Michael, a profiler would indicate that there is no single bottleneck, so it is most likely the algorithm :)
Thorbjørn Ravn Andersen
+1  A: 

I think your main problem is the amount of duplicate data you are producing: for every single leaf of the tree, you will make a copy of the entire path leading up to that leaf and compute the hash for that path. i.e. if you have 50,000 leaves under one top-level node, then the path name of that node will be copied 50,000 times and its hash computed 50,000 times.

If you could organize your data so that shared path prefixes are reused as references between leaves and hash calculations for these prefixes are cached and reused, you could drastically reduce the actual amount of work to be done.

Michael Borgwardt
This sounds like an interesting solution - do you have an example of such a method?
Robert
I don't have the time to supply working code, but basically instead of building the path in StringBuilder instances, represent a path as a list of path elements, each with a name and a partial hash down to that element.
Michael Borgwardt
A: 

You first create a list of all paths and then once you have them all you compute hashes. Size of list of all those paths is O(n^3) (there's O(n^2) paths, each O(n) long) Why? Why not just compute the hashes as you traverse the tree? This way you'll take whole one n out of your time complexity.

The code for proper solution (result ends up in passed in list of integers):

public void getPaths(AbstractTree<String> tree, StringBuilder parentPath, 
    List<Integer> list)
  StringBuilder newPath = parentPath.clone();
  newPath.append("/");
  newPath.append(tree.getNodeName());
  list.add(newPath.toString().hashCode());
  if (!tree.isLeaf()){    
     Iterator<AbstractTree<String>> child = tree.getChildren().iterator();
     for (AbstractTree<String> child : tree.getChildren()){
       getPaths(child, newPath, list)
     }
  }  
}

This still is O(n^2). That is because of hashing of O(n^2) worth of strings (each node has a path length proportional to its depth) and you could bring it down even to O(N) if you had a hash that for given node takes only a hash of the path of its parents and modifies it in some way.

Furhter optimizations include: - parallel tree traversal - using smarter hashing (i.e. hash of a child is a function of child and hash of the parent path, not whole parent path).

Marcin
tried to compute hashes during tree traveral, but it gives the wrong answer - perhaps you can see why? (see original question for code)
Robert
I improved the solution. It should be better now.
Marcin
I'm a little confused by this solution. First, how do you get the result? passing a list as parameter makes a copy of the list and doesn't modify the original list. Secondly, the clone method is not visible to parentPath.
Robert
Robert, passed in list is NOT copied. It is passed in by reference, so you can freely modify it. The SAME problem was in your solution.
Marcin
A: 

I think complexity is still the same. No matter if you use a inline creation of hashes (O(n^2)) or doing it after recursion (O(n^2 + n) = O(n^2)). The only opportunity to find a fast way is to do some of the work at another place. e.g. you can hash the path while inserting a node and only gather all hashes at another point.

Dirk