views:

416

answers:

6

I've written an n-ary tree ADT which works fine. However, I need to store its serialization in a variable a calling class. eg.

 DomTree<String> a = Data.createTreeInstance("very_large_file.xml");
 String x = a.toString();

I've written method which serves the purpose exactly how I need it, but on very large inputs it takes forever (20mins on a 100MB xml file) - I have timed the methods and building the tree from the xml file is quick, but calling toString() as shown above is very slow.

@Override
public String toString(){
 return printTree(this);
}

public String printTree(AbstractTree<E> tree){
 if (tree.isLeaf()){
  return tree.getNodeName();
 }else{
  String tStr = tree.getNodeName() + "(";

  int i = 0;
  Iterator<AbstractTree<E>> child = tree.getChildren().iterator();
     while (i < tree.getChildren().size() - 1){

      tStr += printTree(child.next()) + ", ";
      i++;
     }
     tStr += printTree(child.next()) + ")";

  return tStr; 
 }
}

I'm guessing it is to do with the way the string is built up rather than how the tree is traversed? Is there a better way to do this?

UPDATE: Following the example of Skaffman, the following code give outOfMemoryError for very large input.

@Override
public String toString(){
    StringBuilder buffer = new StringBuilder();
    printTree(this, buffer);
    return buffer.toString();

}

public String printTree(AbstractTree<E> tree, StringBuilder buffer){
 if (tree.isLeaf()){
  return tree.getNodeName();
 }else{
  buffer.append(tree.getNodeName());
  buffer.append("(");

  int i = 0;
  Iterator<AbstractTree<E>> child = tree.getChildren().iterator();
     while (i < tree.getChildren().size() - 1){

      buffer.append(printTree(child.next(), buffer));
      buffer.append(", ");
      i++;
     }
     buffer.append(printTree(child.next(), buffer)); 
     buffer.append(")");

  return buffer.toString(); 
 }
}

UPDATE: Works perfectly now, using Skaffmans example

+2  A: 

Look at StringBuilder, don't use simple concatenation, and pass the StringBuilder through your entire process (or make it a global).

Will Hartung
+11  A: 

String concats like that are punishingly slow. Use a StringBuilder.

@Override
public String toString(){
        StringBuilder buffer = new StringBuilder();
        printTree(this, buffer);
        return buffer.toString();
}

public void printTree(AbstractTree<E> tree, StringBuilder buffer){
    if (tree.isLeaf()){
        buffer.append(tree.getNodeName());
    } else {
        buffer.append(tree.getNodeName());
        buffer.append("(");

        int i = 0;
        Iterator<AbstractTree<E>> child = tree.getChildren().iterator();
        while (i < tree.getChildren().size() - 1){
            printTree(child.next(), buffer);
            buffer.append(", ");
            i++;
        }
        printTree(child.next(), buffer); 
        buffer.append(")");
    }
}
skaffman
+1 nice working example
dfa
I've followed your example, but I get an outOfMemoryError. I've set the VM args to -Xms2g -Xmx2g, but this doesn't help...
Robert
what is the purpose of the String returned by the method?
dfa
the purpose of the String is to plug into several distance metric algorithms i'm testing.
Robert
works a charm now
Robert
Dare I say it, but is this one example where stateful, iterative programming is much superior to functional programming?
skaffman
+1  A: 

Don't use string concatenation in loops. It does not scale.

Use StringBuilder, this does not make new objects all the time, like string concatenation..

void print() {
StringBuilder sb = new StringBuilder();
sb.append("hello");
sb.append(" World!");
System.out.println(sb.toString());

}

raoulsson
This is the perfect answer I think. Concatenation is fine outside of loops--in fact the JVM optimizes it so well that it's probably faster than using any of the alternatives, but in a loop, performance just dies. Look at the String source code if you want to see some interesting optimizations.
Bill K
@Bill K: Performance dies so bad in a loop to the point that the total concatenation cost is O(n^2) in the worst case, right? Just as I said in my answer. Can you have a look at my update?
Tom
A: 

You might want to look at String.intern() as a way to cut down on memory use. This will use the interned String from the string pool. If you have many duplicated strings, it might be faster. More info on interned strings here

sal
the problem is not string comparison but string concatenation; imho String.intern() is not effective in this case
dfa
+1  A: 

Let me say the reason that string concatenation is slow is because strings are immutable. This means every time you write "+=", a new String is created. This means the way you build up your string is in the worst case, O(n2). That's because if you +='ed 1 char at a time, the cost of building a new string would be 2 + 3 + 4 + ... + n, which is O(n2).

Use StringBuilder as other's suggest (over the slower, but threadsafe StringBuffer).

I suppose I should add, StringBuilder will give you O(n) amortized time, because it works like a vector behind the scenes, since it is mutable. So build up your string there, and then call toString().

StringBuilder builder = new StringBuilder();
builder.append("blah"); // append more as needed.
String text = builder.toString();

I would also like to add that this problem is similar in Python. The idiom in python is to append all your strings to concatenate into a list, and then join the list. "".join(the_list).

UPDATE: As Bill points out, concatenation is not the root of all evil. One off string concatenations are fine, and may even be optimized! (They are also worst case linear). But, when you are concatenating in a loop, as you are above, the performance will drastically change as the number of iterations goes up. In that case, my above analysis is flawless, as I specifically stated it is "worst case", which means you assume no optimizations. (Which the JVM can't even optimize the concatenation in loops as well as it can outside).

Tom
Correct in theory, in reality you should look at the String class, some concatenations don't actually allocate new strings. The internal array used to store the string can be shared between two strings of different lengths--so it can be expanded and a new string copied behind the existing one and two Strings can have the same backing arrays with different lengths. Problem is, this only works once--after the "Shared" flag is set, you can't really do it again--so in loops you are completely correct.
Bill K
So why is this -1? I also specifically said that this is the worst case performance... which is definitely correct. Worst case would mean optimizations are working against you.
Tom
But it's not, when in a loop. Maybe I should update and clarify.
Tom
maybe string concatenation is not quadratic? maybe it is linear O(n+m) where n = |str1| and m = |str2| ?
dfa
Please see my clarified answer. It certainly is linear for a one off thing. But it is quadratic in the worst case when you do a series of concatenations.
Tom
@dfa: The reason it is quadratic is because of the number of linear concatenations you do. If you do 1 char at a time, you get (1 + 1) + (2 + 1) + (3 + 1) + ... + (n-1 + 1), which is 2 + 3 + ... + n, which is n*(n + 1)/2 - 1, which is O(n^2).
Tom
I'm a bit upset this answer is in the negatives... there really isn't anything wrong with is, and I thought it would be nice to give an explanation as to why concatenations in loops are slow (so the OP could understand), instead of just saying "oh they are slow, use StringBuilder". Especially now that I updated and clarified...
Tom
+1  A: 

If a profiler confirms you that the bottleneck is string concatenation you have two choices:

  • StringBuilder/StringBuffer (the latter is better suited for threading)
  • Ropes for Java:

A rope is a high performance replacement for Strings. The datastructure, described in detail in "Ropes: an Alternative to Strings", provides asymptotically better performance than both String and StringBuffer for common string modifications like prepend, append, delete, and insert. Like Strings, ropes are immutable and therefore well-suited for use in multi-threaded programming.

dfa