tags:

views:

3826

answers:

4

Is it worth my learning NumPy?

I have approximately 100 financial markets series, and I am going to create a cube array of 100x100x100 = 1 million cells. I will be regressing (3-variable) each x with each y and z, to fill the array with standard errors.

I have heard that for "large matrices" I should use NumPy as opposed to Python lists, for performance and scalability reasons. Thing is, I know python lists and they seem to work for me.

Is the scale of the above problem worth moving to NumPy?

What if I had 1000 series (ie 1 billion floating point cells in the cube)?

Thanks.

+69  A: 

Numpy's arrays are more compact than Python lists -- a list of lists as you describe, in Python, would take at least 20MB or so, while a Numpy 3d array with single-precision floats in the cells would fit in 4MB. Access in reading and writing items is also faster with Numpy.

Maybe you don't care that much for just a million cells, but you definitely would for a billion cells -- neither approach would fit in a 32-bit architecture, but with 64-bit builds Numpy would get away with 4GB or so, Python alone would need at least about 12 GB (lots of pointers which double in size) -- a much costlier piece of HW!

The difference is mostly due to "indirectness" -- a Python list is an array of pointers to Python objects, at least 4 bytes per pointer plus 16 bytes for even the smallest Python object (4 for type pointer, 4 for reference count, 4 for value -- and the memory allocators rounds up to 16). A Numpy array is an array of uniform values -- single-precision numbers takes 4 bytes each, double-precision ones, 8 bytes. Less flexible, but you pay substantially for the flexibility of standard Python lists!

Alex Martelli
Alex - always the good answer. Thank you - point made. I'll go with Numpy for scalability and indeed for efficiency. I'm thinking I'll also soon be needing to learn parallel programming in Python, and invest in some OpenCL capable hardware ;)
Thomas Browne
Tx @Thomas, always glad to help!
Alex Martelli
It's super alex, here to answer NumPy questions in the blink of an eye :)
TM
+24  A: 

Numpy is not only more efficient, it is also more convenient. You get a lot of vector and matrix operations for free, which sometimes allow one to avoid unnecessary work. And they are also efficiently implemented.

For example, you could read your cube directly from a file into an array:

x = numpy.fromfile(file=open("data"), dtype=float).reshape((100, 100, 100))

Sum along the second dimension:

s = x.sum(axis=1)

Find which cells are above a threshold:

(x > 0.5).nonzero()

Remove every even-indexed slice along the third dimension:

x[:, :, ::2]

Also, other libraries that could be useful for you work on numpy arrays. For example, statistical analysis and visualization libraries.

Even if you don't have performance problems, learning numpy is worth the effort.

Roberto Bonvallet
Thanks - you have provided another good reason in your third example, as indeed, I will be searching the matrix for cells above threshold. Moreover, I was loading up from sqlLite. The file approach will be much more efficient.
Thomas Browne
+8  A: 

Note also that there is support for timeseries based on numpy in the timeseries scikits:

http://pytseries.sourceforge.net

For regression, I am pretty sure numpy will be order of magnitude faster and more convenient than lists even for the 100^3 problem

David Cournapeau
+8  A: 

Alex mentioned memory efficiency, and Roberto mentions convenience, and these are both good points. For a few more ideas I'll mention speed and funcionality.

Functionality: You get a lot built in with Numpy, FFTs, convolutions, fast searching, basic statistics, linear algebra, histograms, etc. And really, who can live without FFTs?

Speed: Here's a test on doing a sum over a list and a numpy array, showing that the sum on the numpy array is 10x faster (in this test -- milage may vary).

from numpy import arange
from timeit import Timer

Nelements = 10000
Ntimeits = 10000

x = arange(Nelements)
y = range(Nelements)

t_numpy = Timer("x.sum()", "from __main__ import x")
t_list = Timer("sum(y)", "from __main__ import y")
print "numpy: %.3e" % (t_numpy.timeit(Ntimeits)/Ntimeits,)
print "list:  %.3e" % (t_list.timeit(Ntimeits)/Ntimeits,)

which on my systems (while I'm running a backup) gives:

numpy: 3.004e-05
list:  5.363e-04
tom10