ansaurus

Question

Lazy Evaluation for iterating through NumPy arrays

Answer 1

+2 A:

It seems like the ideal solution would be something like a generator which periodically loaded a block of the data from the disk and then passed the array values out one by one. This would substantially reduce the amount of memory required by the program without requiring any extra work on the part of the individual query functions. Is it possible to do something like this?

Yes, but not by keeping the arrays on disk in a single pickle -- the pickle protocol just isn't designed for "incremental deserialization".

You can write multiple pickles to the same open file, one after the other (use dump, not dumps), and then the "lazy evaluator for iteration" just needs to use pickle.load each time.

Example code (Python 3.1 -- in 2.any you'll want cPickle instead of pickle and a -1 for protocol, etc, of course;-):

>>> import pickle
>>> lol = [range(i) for i in range(5)]
>>> fp = open('/tmp/bah.dat', 'wb')
>>> for subl in lol: pickle.dump(subl, fp)
... 
>>> fp.close()
>>> fp = open('/tmp/bah.dat', 'rb')
>>> def lazy(fp):
...   while True:
...     try: yield pickle.load(fp)
...     except EOFError: break
... 
>>> list(lazy(fp))
[range(0, 0), range(0, 1), range(0, 2), range(0, 3), range(0, 4)]
>>> fp.close()

Alex Martelli 2010-08-03 00:54:07

Answer 2

+6 A:

PyTables is designed to solve this problem for you.

unutbu 2010-08-03 01:25:56

Answer 3

+3 A:

NumPy's memory-mapped data structure (memmap) might be a good choice here.

You access your NumPy arrays from a binary file on disk, without loading the entire file into memory at once.

(Note, i believe, but i am not certain, that Numpys memmap object is not the same as Pythons--in particular, NumPys is array-like, Python's is file-like.)

The method signature is:

A = NP.memmap(filename, dtype, mode, shape, order='C')

All of the arguments are straightforward (i.e., they have the same meaning as used elsewhere in NumPy) except for 'order', which refers to order of the ndarray memory layout. I believe the default is 'C', and the (only) other option is 'F', for Fortran--as elsewhere, these two options represent row-major and column-major order, respectively.

The two methods are:

flush (which writes to disk any changes you make to the array); and

close (which writes the data to the memmap array, or more precisely to an array-like memory-map to the data stored on disk)

example use:

import numpy as NP
from tempfile import mkdtemp
import os.path as PH

my_data = NP.random.randint(10, 100, 10000).reshape(1000, 10)
my_data = NP.array(my_data, dtype="float")

fname = PH.join(mkdtemp(), 'tempfile.dat')

mm_obj = NP.memmap(fname, dtype="float32", mode="w+", shape=1000, 10)

# now write the data to the memmap array:
mm_obj[:] = data[:]

# reload the memmap:
mm_obj = NP.memmap(fname, dtype="float32", mode="r", shape=(1000, 10))

# verify that it's there!:
print(mm_obj[:20,:])

doug 2010-08-03 01:38:45

This is really handy if you don't want to go through the trouble of installing PyTables.

erich 2010-08-05 19:30:46

ansaurus

tags:

views:

answers:

Lazy Evaluation for iterating through NumPy arrays

related questions