views:

628

answers:

3

Hello,

I generate a list of one dimensional numpy arrays in a loop and later convert this list to a 2d numpy array. I would've preallocated a 2d numpy array if i knew the number of items ahead of time, but I don't, therefore I put everything in a list.

The mock up is below:

>>> list_of_arrays = map(lambda x: x*ones(2), range(5))
>>> list_of_arrays
[array([ 0.,  0.]), array([ 1.,  1.]), array([ 2.,  2.]), array([ 3.,  3.]), array([ 4.,  4.])]
>>> arr = array(list_of_arrays)
>>> arr
array([[ 0.,  0.],
       [ 1.,  1.],
       [ 2.,  2.],
       [ 3.,  3.],
       [ 4.,  4.]])

My question is the following:

Is there a better way (performancewise) to go about the task of collecting sequential numerical data (in my case numpy arrays) than putting them in a list and then making a numpy.array out of it (I am creating a new obj and copying the data)? Is there an "expandable" matrix data structure available in a well tested module?

A typical size of my 2d matrix would be between 100x10 and 5000x10 floats

EDIT: In this example i'm using map, but in my actual application I have a for loop

+2  A: 

What you are doing is the standard way. A property of numpy arrays is that they need contiguous memory. The only possibility of "holes" that I can think of is possible with the strides member of PyArrayObject, but that doesn't affect the discussion here. Since numpy arrays have contiguous memory and are "preallocated", adding a new row/column means allocating new memory, copying data, and then freeing the old memory. If you do that a lot, it is not very efficient.

One case where someone might not want to create a list and then convert it to a numpy array in the end is when the list contains a lot of numbers: a numpy array of numbers takes much less space than a native Python list of numbers (since the native Python list stores Python objects). For your typical array sizes, I don't think that is an issue.

When you create your final array from a list of arrays, you are copying all the data to a new location for the new (2-d in your example) array. This is still much more efficient than having a numpy array and doing next = numpy.vstack((next, new_row)) every time you get new data. vstack() will copy all the data for every "row".

There was a thread on numpy-discussion mailing list some time ago which discussed the possibility of adding a new numpy array type that allows efficient extending/appending. It seems there was significant interest in this at that time, although I don't know if something came out of it. You might want to look at that thread.

I would say that what you're doing is very Pythonic, and efficient, so unless you really need something else (more space efficiency, maybe?), you should be okay. That is how I create my numpy arrays when I don't know the number of elements in the array in the beginning.

Alok
@Alok---thanks for the thoughtful answer. The timings in the answer from ~unubuntu show a worry about 5% efficiency. This is almost certainly a mistake until you get to the point that you absolutely must have that 5%.
telliott99
+3  A: 

Suppose you know that the final array arr will never be larger than 5000x10. Then you could pre-allocate an array of maximum size, populate it with data as you go through the loop, and then use arr.resize to cut it down to the discovered size after exiting the loop.

The tests below suggest doing so will be slightly faster than constructing intermediate python lists no matter what the ultimate size of the array is.

Also, arr.resize de-allocates the unused memory, so the final (though maybe not the intermediate) memory footprint is smaller than what is used by python_lists_to_array.

This shows numpy_all_the_way is faster:

% python -mtimeit -s"import test" "test.numpy_all_the_way(100)"
100 loops, best of 3: 1.78 msec per loop
% python -mtimeit -s"import test" "test.numpy_all_the_way(1000)"
100 loops, best of 3: 18.1 msec per loop
% python -mtimeit -s"import test" "test.numpy_all_the_way(5000)"
10 loops, best of 3: 90.4 msec per loop

% python -mtimeit -s"import test" "test.python_lists_to_array(100)"
1000 loops, best of 3: 1.97 msec per loop
% python -mtimeit -s"import test" "test.python_lists_to_array(1000)"
10 loops, best of 3: 20.3 msec per loop
% python -mtimeit -s"import test" "test.python_lists_to_array(5000)"
10 loops, best of 3: 101 msec per loop

This shows numpy_all_the_way uses less memory:

% test.py
Initial memory usage: 19788
After python_lists_to_array: 20976
After numpy_all_the_way: 20348

test.py:

#!/usr/bin/env python
import numpy as np
import os

def memory_usage():
    pid=os.getpid()
    return next(line for line in open('/proc/%s/status'%pid).read().splitlines()
            if line.startswith('VmSize')).split()[-2]

N,M=5000,10

def python_lists_to_array(k):
    list_of_arrays = map(lambda x: x*np.ones(M), range(k))
    arr = np.array(list_of_arrays)
    return arr

def numpy_all_the_way(k):
    arr=np.empty((N,M))
    for x in range(k):
        arr[x]=x*np.ones(M)
    arr.resize((k,M))
    return arr

if __name__=='__main__':
    print('Initial memory usage: %s'%memory_usage())
    arr=python_lists_to_array(5000)
    print('After python_lists_to_array: %s'%memory_usage())    
    arr=numpy_all_the_way(5000)
    print('After numpy_all_the_way: %s'%memory_usage())    
unutbu
+1  A: 

I'll add my own version of ~unutbu's answer. Similar to numpy_all_the way, but you dynamically resize if you have an index error. I thought it would have been a little faster for small data sets, but it's a little slower - the bounds checking slows things down too much.

initial_guess = 1000

def my_numpy_all_the_way(k):
    arr=np.empty((initial_guess,M))
    for x,row in enumerate(make_test_data(k)):
        try:
            arr[x]=row
        except IndexError:
            arr.resize((arr.shape[0]*2, arr.shape[1]))
            arr[x]=row
    arr.resize((k,M))
    return arr
wisty