ansaurus

Question

Performance comparison of immutable string concatenation between Java and Python

Answer 1

+1 A:

My guess is that Python just does a realloc on the string rather than create a new one with a copy of the old one. Since realloc takes no time when there is enough empty space following the allocation, it is very fast.

So how come Python can call realloc and Java can't? Python's garbage collector uses reference counting so it can tell that nobody else is using the string and it won't matter if the string changes. Java's garbage collector doesn't maintain reference counts so it can't tell whether any other reference to the string is extant, meaning it has no choice but to create a whole new copy of the string on every concatenation.

EDIT: Although I don't know that Python actually does call realloc on a concat, here's the comment for _PyString_Resize in stringobject.c indicating why it can:

       The following function breaks the notion that strings are immutable:
       it changes the size of a string.  We get away with this only if there
       is only one module referencing the object.  You can also think of it
       as creating a new string object and destroying the old one, only
       more efficiently.  In any case, don't use this if the string may
       already be known to some other part of the code...

Gabe 2010-10-10 16:55:24

Great explanation; I don't know enough about Python to know if it's true or not. But even if Java did know that "a" isn't being referenced by something else, it would have the JLS (http://java.sun.com/docs/books/jls/third_edition/html/expressions.html#39990) to contend with: *The result is a reference to a String object (**newly created**, unless the expression is a compile-time constant expression)...* So implementations couldn't optimize this even if they wanted to, it seems.

Mark Peters 2010-10-10 17:01:28

Python strings are immutable just like Java's, so I doubt python is doing a realloc().

GregS 2010-10-10 17:18:01

GregS: As I noted in the second paragraph, Python has a reference count so it can do a `realloc` because it can know that nobody else is referencing the string. See my edit for why it's possible.

Gabe 2010-10-10 17:30:35

Gabe: Thanks, I think I understand now. Make sense.

GregS 2010-10-10 17:49:52

@GregS is correct, no realloc is used. A new string is `malloc`'ed and the two strings are `memcpy`'ed into the new string. (at-a-glance of 2.4 source.)

zdav 2010-10-10 17:51:55

@zdav is mistaken, possibly because he's looking at ancient source code; see my answer for details.

Glenn Maynard 2010-10-10 18:45:58

Answer 2

+1 A:

I don't think your test means a lot, since Java and Python handle strings differently (I am no expert in Python but I do know my way in Java). StringBuilders/Buffers exists for a reason in Java. The language designers didn't do any kind of more efficient memory management/manipulation exactly for this reason: there are other tools than the "String" object to do this kind of manipulation and they expect you to use them when you code.

When you do things the way they are meant to be done in Java, you will be surprised how fast the platform is... But I have to admit that I have been pretty much impressed by the performance of some Python applications I have tried recently.

Eric-Karl 2010-10-10 17:21:25

I agree with you partly, but for a benchmark you trying to compare likes with likes. The fact that they handle strings differently is sort of the point of the benchmark. Why is python's seemingly significantly faster? If the OPs benchmark has a defect it is surely not that the two languages do things differently.

GregS 2010-10-10 17:34:45

@GregS: I agree a benchmark should compare likes with likes. Yet, I believe that those "likes" should be the best way to do things on said platform. And, in java, intensively manipulating strings should be done with the StringBuilders.Why is python's faster? I don't know but I admit it's impressive. I just think the benchmark here is a bit too far from real world application...

Eric-Karl 2010-10-11 01:00:56

Answer 3

A:

I do not know the answer for sure. But here are some thoughts. First, Java internally stores strings as char [] arrays containing the UTF-16 encoding of the string. This means that every character in the strings takes at least two bytes. So just in terms of raw storage, Java would have to copy around twice as much data as python strings. Python unicode strings are therefore the better test because they are similarly capable. Perhaps python stores unicode strings as UTF-8 encoded bytes. In that case, if all you are storing in these are ASCII characters, then again you'd have Java using twice as much space and therefore doing twice as much copying. To get a better comparison you should concatenate strings containing more interesting characters that require two or more bytes in their UTF-8 encoding.

GregS 2010-10-10 17:28:42

Python does no such optimization. From Python's unicodeobject.h: "Setting Py_UNICODE_WIDE enables UCS-4 storage. Otherwise, Unicode strings are stored as UCS-2 (with limited support for UTF-16)"

Gabe 2010-10-10 17:38:34

@Gabe: Thanks, that is good information. So perhaps any performance improvement is due entirely to considerations you outlined in your answer.

GregS 2010-10-10 17:51:26

Answer 4

A:

I ran Java code with a StringBuilder in place of a String and saw an average finish time of 10ms (high 34ms, low 5ms).

As for the Python code, using "Method 6" here (found to be the fastest method), I was able to achieve an average of 84ms (high 91ms, low 81ms) using unicode strings. Using non-unicode strings reduced these numbers by ~25ms.

As such, it can be said based on these highly unscientific tests that using the fastest available method for string concatenation, Java is roughly an order of magnitude faster than Python.

But I still <3 Python ;)

Matt 2010-10-10 17:47:19

His results are wrong, or at least out of date: method 1 is the fastest in 2.6 by a small margin. (It's not surprising if you havn't run it, since it doesn't even run as-is, as it uses a module named `timing` which he doesn't include, rather than using the standard module `timeit`.) Note also that the `ps_stats` call takes a long time (about 18ms here), and must be removed to have meaningful results.

Glenn Maynard 2010-10-10 18:58:11

Answer 5

+1 A:

@Gabe's answer is correct, but needs to be shown clearly rather than hypothesized.

CPython (and probably only CPython) does an in-place string append when it can. There are limitations on when it can do this.

First, it can't do it for interned strings. That's why you'll never see this if you test with a = "testing"; a = a + "testing", because assigning a string literal results in an interned string. You have to create the string dynamically, as this code does with str(12345). (This isn't much of a limitation; once you do an append this way once, the result is an uninterned string, so if you append string literals in a loop this will only happen the first time.)

Second, Python 2.x only does this for str, not unicode. Python 3.x does do this for Unicode strings. This is strange: it's a major performance difference--a difference in complexity. This discourages using Unicode strings in 2.x, when they should be encouraging it to help the transition to 3.x.

And finally, there can be no other references to the string.

>>> a = str(12345)
>>> id(a)
3082418720
>>> a += str(67890)
>>> id(a)
3082418720

This explains why the non-Unicode version is so much faster in your test than the Unicode version.

The actual code for this is string_concatenate in Python/ceval.c, and works for both s1 = s1 + s2 and s1 += s2. The function _PyString_Resize in Objects/stringobject.c also says explicitly: The following function breaks the notion that strings are immutable. See also http://bugs.python.org/issue980695.

Glenn Maynard 2010-10-10 18:36:47

Thanks. It's the answer I'm looking for.

Cuper Hector 2010-10-11 03:00:23

ansaurus

tags:

views:

answers:

Performance comparison of immutable string concatenation between Java and Python

related questions