tags:

views:

124

answers:

2

In a recent question I asked about the fastest way to convert a large numpy array to a delimited string. My reason for asking was because I wanted to take that plain text string and transmit it (over HTTP for instance) to clients written in other programming languages. A delimited string of numbers is obviously something that any client program can work with easily. However, it was suggested that because string conversion is slow, it would be faster on the Python side to do base64 encoding on the array and send it as binary. This is indeed faster.

My question now is, (1) how can I make sure my encoded numpy array will travel well to clients on different operating systems and different hardware, and (2) how do I decode the binary data on the client side.

For (1), my inclination is to do something like the following

import numpy as np
import base64
x = np.arange(100, dtype=np.float64)
base64.b64encode(x.tostring())

Is there anything else I need to do?

For (2), I would be happy to have an example in any programming language, where the goal is to take the numpy array of floats and turn them into a similar native data structure. Assume we have already done base64 decoding and have a byte array, and that we also know the numpy dtype, dimensions, and any other metadata which will be needed.

Thanks.

A: 

What the tostring method of numpy arrays does is basically give you a dump of the memory used by the array's data (not the object wrapper for Python, but just the data of the array.) This is similar to the struct stdlib module. Base64-encoding that string and sending it across should be quite good enough, although you may also need to send across the actual datatype used, as well as the dimensions if it's a multidimensional array, as you won't be able to tell those just from the data.

On the other side, how to read the data depends a little on the language. Most languages have a way of addressing such a block of memory as a particular type of array. For example, in C, you could simply base64-decode the string, assign it to (in the case of your example) a float64 * and index away. This doesn't give you any of the built-in safeguards and functions and other operations that numpy arrays have in Python, but that's because C is quite a different language in that respect.

Thomas Wouters
A: 

I'd recommend using an existing data format for interchange of scientific data/arrays, such as NetCDF or HDF. In Python, you can use the PyNIO library which has numpy bindings, and there are several libraries for other languages. Both formats are built for handling large data and take care of language, machine representation problems, etc. They also work well with message passing, for example in parallel computing, so I suspect your use case is satisfied.

ars