views:

76

answers:

3

I have a Python program that processes fairly large NumPy arrays (in the hundreds of megabytes), which are stored on disk in pickle files (one ~100MB array per file). When I want to run a query on the data I load the entire array, via pickle, and then perform the query (so that from the perspective of the Python program the entire array is in memory, even if the OS is swapping it out). I did this mainly because I believed that being able to use vectorized operations on NumPy arrays would be substantially faster than using for loops through each item.

I'm running this on a web server which has memory limits that I quickly run up against. I have many different kinds of queries that I run on the data so writing "chunking" code which loads portions of the data from separate pickle files, processes them, and then proceeds to the next chunk would likely add a lot of complexity. It'd definitely be preferable to make this "chunking" transparent to any function that processes these large arrays.

It seems like the ideal solution would be something like a generator which periodically loaded a block of the data from the disk and then passed the array values out one by one. This would substantially reduce the amount of memory required by the program without requiring any extra work on the part of the individual query functions. Is it possible to do something like this?

+2  A: 

It seems like the ideal solution would be something like a generator which periodically loaded a block of the data from the disk and then passed the array values out one by one. This would substantially reduce the amount of memory required by the program without requiring any extra work on the part of the individual query functions. Is it possible to do something like this?

Yes, but not by keeping the arrays on disk in a single pickle -- the pickle protocol just isn't designed for "incremental deserialization".

You can write multiple pickles to the same open file, one after the other (use dump, not dumps), and then the "lazy evaluator for iteration" just needs to use pickle.load each time.

Example code (Python 3.1 -- in 2.any you'll want cPickle instead of pickle and a -1 for protocol, etc, of course;-):

>>> import pickle
>>> lol = [range(i) for i in range(5)]
>>> fp = open('/tmp/bah.dat', 'wb')
>>> for subl in lol: pickle.dump(subl, fp)
... 
>>> fp.close()
>>> fp = open('/tmp/bah.dat', 'rb')
>>> def lazy(fp):
...   while True:
...     try: yield pickle.load(fp)
...     except EOFError: break
... 
>>> list(lazy(fp))
[range(0, 0), range(0, 1), range(0, 2), range(0, 3), range(0, 4)]
>>> fp.close()
Alex Martelli
+6  A: 

PyTables is designed to solve this problem for you.

unutbu
+3  A: 

NumPy's memory-mapped data structure (memmap) might be a good choice here.

You access your NumPy arrays from a binary file on disk, without loading the entire file into memory at once.

(Note, i believe, but i am not certain, that Numpys memmap object is not the same as Pythons--in particular, NumPys is array-like, Python's is file-like.)

The method signature is:

A = NP.memmap(filename, dtype, mode, shape, order='C')

All of the arguments are straightforward (i.e., they have the same meaning as used elsewhere in NumPy) except for 'order', which refers to order of the ndarray memory layout. I believe the default is 'C', and the (only) other option is 'F', for Fortran--as elsewhere, these two options represent row-major and column-major order, respectively.

The two methods are:

flush (which writes to disk any changes you make to the array); and

close (which writes the data to the memmap array, or more precisely to an array-like memory-map to the data stored on disk)

example use:

import numpy as NP
from tempfile import mkdtemp
import os.path as PH

my_data = NP.random.randint(10, 100, 10000).reshape(1000, 10)
my_data = NP.array(my_data, dtype="float")

fname = PH.join(mkdtemp(), 'tempfile.dat')

mm_obj = NP.memmap(fname, dtype="float32", mode="w+", shape=1000, 10)

# now write the data to the memmap array:
mm_obj[:] = data[:]

# reload the memmap:
mm_obj = NP.memmap(fname, dtype="float32", mode="r", shape=(1000, 10))

# verify that it's there!:
print(mm_obj[:20,:])
doug
This is really handy if you don't want to go through the trouble of installing PyTables.
erich