ansaurus

Question

Fastest way to generate delimited string from 1d numpy array

Answer 1

+2 A:

Very good writeup on the performance of various string concatenation techniques in Python: http://www.skymind.com/~ocrow/python_string/

I'm a little surprised that some of the latter approaches perform as well as they do, but looks like you can certainly find something there that will work better for you than what you're doing there.

sblom 2010-04-27 13:26:21

Thanks sblom. Unfortunately my code is already essentially the same as the fastest solution mentioned. Perhaps there is just no way to get it to go faster.

Abiel 2010-04-27 13:45:57

@Abiel If you really want it faster then you should look into using Cython.

Justin Peel 2010-04-27 22:52:19

Answer 2

+2 A:

I think you could experiment with numpy.savetxt passing a cStringIO.StringIO object as a fake file...

Or maybe using str(x) and doing a replacement of the whitespaces by commas (edit: this won't work quite well because the str does an ellipsis of large arrays :-s).

As the purpose of this was to send the array over the network, maybe there are better alternatives (more efficient both in cpu and bandwidth). The one I pointed out in a comment to other answer as to encode the binary representation of the array as a Base64 text block. The main inconvenient for this to be optimal is that the client reading the chunk of data should be able to do nasty things like reinterpret a byte array as a float array, and that's not usually allowed in type safe languages; but it could be done quickly with a C library call (and most languages provide means to do this).

In case you cannot mess with bits, there's always the possibility of processing the numbers one by one to convert the decoded bytes to floats.

Oh, and watch out for the endiannes of the machines when sending data through the network: convert to network order -> base64encode -> send | receive -> base64decode -> convert to host order

fortran 2010-04-27 13:43:17

Thanks fortran. Unfortunately I'm still not able to get a speed improvement with either savetxt or with str(x). str(x) at first appears to be much faster, but this disappears once np.set_printoptions(threshold=100000) (see my comment on unutbu's answer).

Abiel 2010-04-27 14:19:52

Answer 3

+1 A:

numpy.savetxt is even slower than string.join. ndarray.tofile() doesn't seem to work with StringIO.

But I do find a faster method (at least applying to OP's example on python2.5 with lower version of numpy):

import numpy as np
x = np.random.randn(100000)
for i in range(100):
    (",%f"*100000)[1:] % tuple(x)

It looks like string format is faster than string join if you have a well defined format such as in this particular case. But I wonder why OP needs such a long string of floating numbers in memory.

Newer versions of numpy shows no speed improvement.

Dingle 2010-04-27 14:53:42

Dingle - For whatever reason I am not finding this to be faster than my original example of join and str. As to why I need these long strings, I have a server application that operates on numpy arrays and then distributes the results in plain-text strings so that a variety of clients (including non-Python clients) can consume the data (this includes sending data over HTTP to remote clients). If there is a better way to distribute the data I would be happy to use it, but remember that clients using any programming language and running on any operating system would need to be able to consume it.

Abiel 2010-04-27 15:03:55

For that use, compressed binary data is better than plain text! :-) my HTTP knowledge is a little bit rusty now, but you can at least encode the raw floats in Base64 to get better bit-density than in decimal. Make sure that the marshalling scheme is the same in all platforms (check network and host byte order and IEEE 754 compatible representations). If there's no numpy method to do that, you could write your own routine in C and call it with `ctypes`.

fortran 2010-04-27 15:24:29

Thanks fortran, this looks like it may be the answer. Certainly doing x.tostring() in numpy is very fast. I'm not very familiar with reading and writing binary data across different environments, but I will dig into this.

Abiel 2010-04-27 16:34:21

@Abiel, timeit shows 20~30% faster. Not sure if fortran's suggestion will improve the speed if the data size is not an issue here. What about JSON or XML? I thought binary data over network is not safe to unpack.

Dingle 2010-04-27 16:34:38

fortran - After looking at your suggestion a bit more, I'm confused about how in practice you would decode the data at the client side, given that the client will not necessarily be written in Python. For example, the client might be written in Visual Basic and be designed to drop numerical arrays into a spreadsheet. In this case I would need to know how to take a binary representation of a numpy array and translate it into something like a VB Variant. Thoughts?

Abiel 2010-04-27 18:22:31

@Dingle it is faster because '%f' isn't writing out as many digits as str()

Justin Peel 2010-04-27 19:43:49

@Justin, good point. I tested again with "%.12f" (should be equivalent to the default str behavior), still 10~15% faster.

Dingle 2010-04-27 21:47:20

@Dingle It's about 15% slower than the OP's code on my computer. I'm using Python 2.6.4 with Numpy 1.4.1 and using timeit.

Justin Peel 2010-04-27 22:51:52

@Justin, I am working on python 2.5 (on python2.5.2, string format is way faster than OP's version). I tested it on Python2.6, and string format is slightly slower. Obviously Python2.6 somehow optimizes the string concatenation. :)

Dingle 2010-04-27 23:22:11

@Dingle Okay, that makes more sense then. It's good to know my sanity is still somewhat intact.

Justin Peel 2010-04-27 23:47:28

@Abiel It's just a plain C array... I'll update my answer to clarify this, as I'm not getting notifications when new comments are added here :-)

fortran 2010-04-28 08:51:22

Answer 4

A:

Using imap from itertools instead of map in the OP's code is giving me about a 2-3% improvement which isn't much, but something that might combine with other ideas to give more improvement.

Personally, I think that if you want much better than this that you will have to use something like Cython.

Justin Peel 2010-04-27 19:40:03

ansaurus

tags:

views:

answers:

Fastest way to generate delimited string from 1d numpy array

related questions