tags:

views:

235

answers:

4

I have a program which needs to turn many large one-dimensional numpy arrays of floats into delimited strings. I am finding this operation quite slow relative to the mathematical operations in my program and am wondering if there is a way to speed it up. For example, consider the following loop, which takes 100,000 random numbers in a numpy array and joins each array into a comma-delimited string.

import numpy as np
x = np.random.randn(100000)
for i in range(100):
    ",".join(map(str, x))

This loop takes about 20 seconds to complete (total, not each cycle). In contrast, consider that 100 cycles of something like elementwise multiplication (x*x) would take than one 1/10 of a second to complete. Clearly the string join operation creates a large performance bottleneck; in my actual application it will dominate total runtime. This makes me wonder, is there a faster way than ",".join(map(str, x))? Since map() is where almost all the processing time occurs, this comes down to the question of whether there a faster to way convert a very large number of numbers to strings.

+2  A: 

Very good writeup on the performance of various string concatenation techniques in Python: http://www.skymind.com/~ocrow/python_string/

I'm a little surprised that some of the latter approaches perform as well as they do, but looks like you can certainly find something there that will work better for you than what you're doing there.

sblom
Thanks sblom. Unfortunately my code is already essentially the same as the fastest solution mentioned. Perhaps there is just no way to get it to go faster.
Abiel
@Abiel If you really want it faster then you should look into using Cython.
Justin Peel
+2  A: 

I think you could experiment with numpy.savetxt passing a cStringIO.StringIO object as a fake file...

Or maybe using str(x) and doing a replacement of the whitespaces by commas (edit: this won't work quite well because the str does an ellipsis of large arrays :-s).

As the purpose of this was to send the array over the network, maybe there are better alternatives (more efficient both in cpu and bandwidth). The one I pointed out in a comment to other answer as to encode the binary representation of the array as a Base64 text block. The main inconvenient for this to be optimal is that the client reading the chunk of data should be able to do nasty things like reinterpret a byte array as a float array, and that's not usually allowed in type safe languages; but it could be done quickly with a C library call (and most languages provide means to do this).

In case you cannot mess with bits, there's always the possibility of processing the numbers one by one to convert the decoded bytes to floats.

Oh, and watch out for the endiannes of the machines when sending data through the network: convert to network order -> base64encode -> send | receive -> base64decode -> convert to host order

fortran
Thanks fortran. Unfortunately I'm still not able to get a speed improvement with either savetxt or with str(x). str(x) at first appears to be much faster, but this disappears once np.set_printoptions(threshold=100000) (see my comment on unutbu's answer).
Abiel
+1  A: 

numpy.savetxt is even slower than string.join. ndarray.tofile() doesn't seem to work with StringIO.

But I do find a faster method (at least applying to OP's example on python2.5 with lower version of numpy):

import numpy as np
x = np.random.randn(100000)
for i in range(100):
    (",%f"*100000)[1:] % tuple(x)

It looks like string format is faster than string join if you have a well defined format such as in this particular case. But I wonder why OP needs such a long string of floating numbers in memory.

Newer versions of numpy shows no speed improvement.

Dingle
Dingle - For whatever reason I am not finding this to be faster than my original example of join and str. As to why I need these long strings, I have a server application that operates on numpy arrays and then distributes the results in plain-text strings so that a variety of clients (including non-Python clients) can consume the data (this includes sending data over HTTP to remote clients). If there is a better way to distribute the data I would be happy to use it, but remember that clients using any programming language and running on any operating system would need to be able to consume it.
Abiel
For that use, compressed binary data is better than plain text! :-) my HTTP knowledge is a little bit rusty now, but you can at least encode the raw floats in Base64 to get better bit-density than in decimal. Make sure that the marshalling scheme is the same in all platforms (check network and host byte order and IEEE 754 compatible representations). If there's no numpy method to do that, you could write your own routine in C and call it with `ctypes`.
fortran
Thanks fortran, this looks like it may be the answer. Certainly doing x.tostring() in numpy is very fast. I'm not very familiar with reading and writing binary data across different environments, but I will dig into this.
Abiel
@Abiel, timeit shows 20~30% faster. Not sure if fortran's suggestion will improve the speed if the data size is not an issue here. What about JSON or XML? I thought binary data over network is not safe to unpack.
Dingle
fortran - After looking at your suggestion a bit more, I'm confused about how in practice you would decode the data at the client side, given that the client will not necessarily be written in Python. For example, the client might be written in Visual Basic and be designed to drop numerical arrays into a spreadsheet. In this case I would need to know how to take a binary representation of a numpy array and translate it into something like a VB Variant. Thoughts?
Abiel
@Dingle it is faster because '%f' isn't writing out as many digits as str()
Justin Peel
@Justin, good point. I tested again with "%.12f" (should be equivalent to the default str behavior), still 10~15% faster.
Dingle
@Dingle It's about 15% slower than the OP's code on my computer. I'm using Python 2.6.4 with Numpy 1.4.1 and using timeit.
Justin Peel
@Justin, I am working on python 2.5 (on python2.5.2, string format is way faster than OP's version). I tested it on Python2.6, and string format is slightly slower. Obviously Python2.6 somehow optimizes the string concatenation. :)
Dingle
@Dingle Okay, that makes more sense then. It's good to know my sanity is still somewhat intact.
Justin Peel
@Abiel It's just a plain C array... I'll update my answer to clarify this, as I'm not getting notifications when new comments are added here :-)
fortran
A: 

Using imap from itertools instead of map in the OP's code is giving me about a 2-3% improvement which isn't much, but something that might combine with other ideas to give more improvement.

Personally, I think that if you want much better than this that you will have to use something like Cython.

Justin Peel