views:

520

answers:

4

I have a large object I'd like to serialize to disk. I'm finding marshal works quite well and is nice and fast.

Right now I'm creating my large object then calling marshal.dump . I'd like to avoid holding the large object in memory if possible - I'd like to dump it incrementally as I build it. Is that possible?

The object is fairly simple, a dictionary of arrays.

A: 

This very much depends on how you are building the object. Is it an array of sub objects? You could marshal/pickle each array element as you build it. Is it a dictionary? Same idea applies (marshal/pickle keys)

If it is just a big complex harry object, you might want to marshal dump each piece of the object, and then the apply what ever your 'building' process is when you read it back in.

thebigjc
I updated the question. The object is pretty simple, a dictionary of arrays. Each array is quite small. Would I marshal each individual array? To the same file?
Parand
A: 

You should be able to dump the item piece by piece to the file. The two design questions that need settling are:

  1. How are you building the object when you're putting it in memory?
  2. How do you need you're data when it comes out of memory?

If your build process populates the entire array associated with a given key at a time, you might just dump the key:array pair in a file as a separate dictionary:

big_hairy_dictionary['sample_key'] = pre_existing_array
marshal.dump({'sample_key':big_hairy_dictionary['sample_key']},'central_file')

Then on update, each call to marshal.load('central_file') will return a dictionary that you can use to update a central dictionary. But this is really only going to be helpful if, when you need the data back, you want to handle reading 'central_file' once per key.

Alternately, if you are populating arrays element by element in no particular order, maybe try:

big_hairy_dictionary['sample_key'].append(single_element)
marshal.dump(single_element,'marshaled_files/'+'sample_key')

Then, when you load it back, you don't necessarily need to build the entire dictionary to get back what you need; you just call marshal.load('marshaled_files/sample_key') until it returns None, and you have everything associated with the key.

David Berger
+2  A: 

The bsddb module's 'hashopen' and 'btopen' functions provide a persistent dictionary-like interface. Perhaps you could use one of these, instead of a regular dictionary, to incrementally serialize the arrays to disk?

import bsddb
import marshal

db = bsddb.hashopen('file.db')
db['array1'] = marshal.dumps(array1)
db['array2'] = marshal.dumps(array2)
...
db.close()

To retrieve the arrays:

db = bsddb.hashopen('file.db')
array1 = marshal.loads(db['array1'])
...
elo80ka
+1  A: 

It all your object has to do is be a dictionary of lists, then you may be able to use the shelve module. It presents a dictionary-like interface where the keys and values are stored in a database file instead of in memory. One limitation which may or may not affect you is that keys in Shelf objects must be strings. Value storage will be more efficient if you specify protocol=-1 when creating the Shelf object to have it use a more efficient binary representation.

Theran