views:

210

answers:

5

From http://jaynes.colorado.edu/PythonIdioms.html

"Build strings as a list and use ''.join at the end. join is a string method called on the separator, not the list. Calling it from the empty string concatenates the pieces with no separator, which is a Python quirk and rather surprising at first. This is important: string building with + is quadratic time instead of linear! If you learn one idiom, learn this one.

Wrong: for s in strings: result += s

Right: result = ''.join(strings)"

I'm not sure why this is true. If I have some strings I want to join them, for me it isn't intuitively better to me to put them in a list then call ''.join. Doesn't putting them into a list create some overhead? To Clarify...

Python Command Line:

>>> str1 = 'Not'
>>> str2 = 'Cool'
>>> str3 = ''.join([str1, ' ', str2]) #The more efficient way **A**
>>> print str3
Not Cool
>>> str3 = str1 + ' ' + str2 #The bad way **B**
>>> print str3
Not Cool

Is A really linear time and B is quadratic time?

A: 

It's not obvious if you're referring to the same thing as other people. This optimization is important when you have many strings, say M of length N. Then

A

x = ''.join(strings) # Takes M*N operations 

B

x = ''
for s in strings:
    x = x + s  # Takes N + 2*N + ... + M*N operations

Unless optimized away by the implementation, yes, A is linear in the total length T = M*N and B is T*T / N which is always worse and roughly quadratic if M >> N.

Now why it is actually quite intuitive to join: when you say "I have some strings" this typically can be formalized by saying that you have an iterator that returns strings. Now, this is exactly what you pass to "string".join()

ilya n.
+4  A: 

Repeated concatenation is quadratic because it's Schlemiel the Painter's Algorithm (beware that some implementations will optimize this away so that it is not quadratic). join avoids this because it takes the entire list of strings, allocates the necessary space and does the concatenation in one pass.

Jason
+7  A: 

Yes. For the examples you chose the importance isn't clear because you only have two very short strings so the append would probably be faster.

But every time you do a + b with strings in Python it causes a new allocation and then copies all the bytes from a and b into the new string. If you do this in a loop with lots of strings these bytes have to be copied again, and again, and again and each time the amount that has to be copied gets longer. This gives the quadratic behaviour.

On the other hand, creating a list of strings doesn't copy the contents of the strings - it just copies the references. This is incredibly fast, and runs in linear time. The join method then makes just one memory allocation and copies each string into the correct position only once. This also takes only linear time.

So yes, do use the ''.join idiom if you are potentially dealing with a large number of strings. For just two strings it doesn't matter.

If you need more convincing, try it for yourself creating a string from 10M characters:

>>> chars = ['a'] * 10000000
>>> r = ''
>>> for c in chars: r += c
>>> print len(r)

Compared with:

>>> chars = ['a'] * 10000000
>>> r = ''.join(chars)
>>> print len(r)

The first method takes about 10 seconds. The second takes under 1 second.

Mark Byers
+3  A: 

When you code s1 + s2, Python needs to allocate a new string object, copy all characters of s1 into it, then after that all characters of s2. This trivial operation does not bear quadratic time costs: the cost is O(len(s1) + len(s2)) (plus a constant for allocation, but that doesn't figure in big-O;-).

However, consider the code in the quote you're giving: for s in strings: result += s.

Here, every time a new s is added, all the previous ones have to be first copied into the newly allocated space for result (strings are immutable, so the new allocation and copy must take place). Suppose you have N strings of length L: you'll copy L characters the first time, then 2 * L the second time, then 3 * L the third time... in all, that makes it L * N * (N+1) / 2 characters getting copied... so, yep, it's quadratic in N.

In some other cases, a quadratic algorithm may be faster than a linear one for small-enough values of N (because the multipliers and constant fixed-costs may be much smaller); but that's not the case here because allocations are costly (both directly, and indirectly because of the likelihood of fragmenting memory). In comparison, the overheads of accumulating the strings into a list is essentially negligible.

Alex Martelli
+1  A: 

Joel writes about this in Back to Basics.

Dustin Getz