views:

173

answers:

5

UPDATES: thanks a lot to Gabe and Glenn for the detailed explanation. The test is wrote not for language comparison benchmark, just for my studying on VM optimization technologies.

I did a simple test to understand the performance of string concatenation between Java and Python.

The test is target for the default immutable String object/type in both languages. So I don't use StringBuilder/StringBuffer in Java test.

The test simply adds strings for 100k times. Java consumes ~32 seconds to finish, while Python only uses ~13 seconds for Unicode string and 0.042 seconds for non Unicode string.

I'm a bit surprise about the results. I thought Java should be faster than Python. What optimization technology does Python leverage to achieve better performance? Or String object is designed too heavy in Java?

OS: Ubuntu 10.04 x64 JDK: Sun 1.6.0_21 Python: 2.6.5

Java test did use -Xms1024m to minimize GC activities.

Java code:

public class StringConcateTest {
public static void test(int n) {
    long start = System.currentTimeMillis();
    String a = "";
    for (int i = 0; i < n; i++) {
        a = a.concat(String.valueOf(i));
    }
    long end = System.currentTimeMillis();
    System.out.println(a.length() + ", time:" + (end - start));
}

public static void main(String[] args) {
    for (int i = 0; i < 10; i++) {
        test(1000 * 100);           
    }
}

}

Python code:

import time
def f(n):
    start = time.time()
    a = u'' #remove u to use non Unicode string
    for i in xrange(n):
        a = a + str(i)
    print len(a), 'time', (time.time() - start)*1000.0
for j in xrange(10):
    f(1000 * 100)
+1  A: 

My guess is that Python just does a realloc on the string rather than create a new one with a copy of the old one. Since realloc takes no time when there is enough empty space following the allocation, it is very fast.

So how come Python can call realloc and Java can't? Python's garbage collector uses reference counting so it can tell that nobody else is using the string and it won't matter if the string changes. Java's garbage collector doesn't maintain reference counts so it can't tell whether any other reference to the string is extant, meaning it has no choice but to create a whole new copy of the string on every concatenation.

EDIT: Although I don't know that Python actually does call realloc on a concat, here's the comment for _PyString_Resize in stringobject.c indicating why it can:

       The following function breaks the notion that strings are immutable:
       it changes the size of a string.  We get away with this only if there
       is only one module referencing the object.  You can also think of it
       as creating a new string object and destroying the old one, only
       more efficiently.  In any case, don't use this if the string may
       already be known to some other part of the code...
Gabe
Great explanation; I don't know enough about Python to know if it's true or not. But even if Java did know that "a" isn't being referenced by something else, it would have the JLS (http://java.sun.com/docs/books/jls/third_edition/html/expressions.html#39990) to contend with: *The result is a reference to a String object (**newly created**, unless the expression is a compile-time constant expression)...* So implementations couldn't optimize this even if they wanted to, it seems.
Mark Peters
Python strings are immutable just like Java's, so I doubt python is doing a realloc().
GregS
GregS: As I noted in the second paragraph, Python has a reference count so it can do a `realloc` because it can know that nobody else is referencing the string. See my edit for why it's possible.
Gabe
Gabe: Thanks, I think I understand now. Make sense.
GregS
@GregS is correct, no realloc is used. A new string is `malloc`'ed and the two strings are `memcpy`'ed into the new string. (at-a-glance of 2.4 source.)
zdav
@zdav is mistaken, possibly because he's looking at ancient source code; see my answer for details.
Glenn Maynard
+1  A: 

I don't think your test means a lot, since Java and Python handle strings differently (I am no expert in Python but I do know my way in Java). StringBuilders/Buffers exists for a reason in Java. The language designers didn't do any kind of more efficient memory management/manipulation exactly for this reason: there are other tools than the "String" object to do this kind of manipulation and they expect you to use them when you code.

When you do things the way they are meant to be done in Java, you will be surprised how fast the platform is... But I have to admit that I have been pretty much impressed by the performance of some Python applications I have tried recently.

Eric-Karl
I agree with you partly, but for a benchmark you trying to compare likes with likes. The fact that they handle strings differently is sort of the point of the benchmark. Why is python's seemingly significantly faster? If the OPs benchmark has a defect it is surely not that the two languages do things differently.
GregS
@GregS: I agree a benchmark should compare likes with likes. Yet, I believe that those "likes" should be the best way to do things on said platform. And, in java, intensively manipulating strings should be done with the StringBuilders.Why is python's faster? I don't know but I admit it's impressive. I just think the benchmark here is a bit too far from real world application...
Eric-Karl
A: 

I do not know the answer for sure. But here are some thoughts. First, Java internally stores strings as char [] arrays containing the UTF-16 encoding of the string. This means that every character in the strings takes at least two bytes. So just in terms of raw storage, Java would have to copy around twice as much data as python strings. Python unicode strings are therefore the better test because they are similarly capable. Perhaps python stores unicode strings as UTF-8 encoded bytes. In that case, if all you are storing in these are ASCII characters, then again you'd have Java using twice as much space and therefore doing twice as much copying. To get a better comparison you should concatenate strings containing more interesting characters that require two or more bytes in their UTF-8 encoding.

GregS
Python does no such optimization. From Python's unicodeobject.h: "Setting Py_UNICODE_WIDE enables UCS-4 storage. Otherwise, Unicode strings are stored as UCS-2 (with limited support for UTF-16)"
Gabe
@Gabe: Thanks, that is good information. So perhaps any performance improvement is due entirely to considerations you outlined in your answer.
GregS
A: 

I ran Java code with a StringBuilder in place of a String and saw an average finish time of 10ms (high 34ms, low 5ms).

As for the Python code, using "Method 6" here (found to be the fastest method), I was able to achieve an average of 84ms (high 91ms, low 81ms) using unicode strings. Using non-unicode strings reduced these numbers by ~25ms.

As such, it can be said based on these highly unscientific tests that using the fastest available method for string concatenation, Java is roughly an order of magnitude faster than Python.

But I still <3 Python ;)

Matt
His results are wrong, or at least out of date: method 1 is the fastest in 2.6 by a small margin. (It's not surprising if you havn't run it, since it doesn't even run as-is, as it uses a module named `timing` which he doesn't include, rather than using the standard module `timeit`.) Note also that the `ps_stats` call takes a long time (about 18ms here), and must be removed to have meaningful results.
Glenn Maynard
+1  A: 

@Gabe's answer is correct, but needs to be shown clearly rather than hypothesized.

CPython (and probably only CPython) does an in-place string append when it can. There are limitations on when it can do this.

First, it can't do it for interned strings. That's why you'll never see this if you test with a = "testing"; a = a + "testing", because assigning a string literal results in an interned string. You have to create the string dynamically, as this code does with str(12345). (This isn't much of a limitation; once you do an append this way once, the result is an uninterned string, so if you append string literals in a loop this will only happen the first time.)

Second, Python 2.x only does this for str, not unicode. Python 3.x does do this for Unicode strings. This is strange: it's a major performance difference--a difference in complexity. This discourages using Unicode strings in 2.x, when they should be encouraging it to help the transition to 3.x.

And finally, there can be no other references to the string.

>>> a = str(12345)
>>> id(a)
3082418720
>>> a += str(67890)
>>> id(a)
3082418720

This explains why the non-Unicode version is so much faster in your test than the Unicode version.

The actual code for this is string_concatenate in Python/ceval.c, and works for both s1 = s1 + s2 and s1 += s2. The function _PyString_Resize in Objects/stringobject.c also says explicitly: The following function breaks the notion that strings are immutable. See also http://bugs.python.org/issue980695.

Glenn Maynard
Thanks. It's the answer I'm looking for.
Cuper Hector