views:

35

answers:

1

Is IO more efficient, due to the linux disk buffer cache, when storing frequently accessed python objects as separate cPickle files instead of storing all objects in one large shelf?

Does the disk buffer cache operate differently in these two scenarios with respect to efficiency?

There may be thousands of large files (generally around 100Mb, but sometimes 1Gb), but much RAM (eg 64 Gb).

+1  A: 

I don't know of any theoretical way to decide which method is faster, and even if I did, I'm not sure I would trust it. So let's write some code and test it.

If we package our pickle/shelve managers in classes with a common interface, then it will be easy to swap them in and out of your code. So if at some future point you discover one is better than the other (or discover some even better way) all you have to do is write a class with the same interface and you'll be able to plug the new class into your code with very little modification to anything else.

test.py:

import cPickle
import shelve
import os

class PickleManager(object):
    def store(self,name,value):
        with open(name,'w') as f:
            cPickle.dump(value,f)
    def load(self,name):
        with open(name,'r') as f:
            return cPickle.load(f)

class ShelveManager(object):
    def __enter__(self):
        if os.path.exists(self.fname):
            self.shelf=shelve.open(self.fname)
        else:
            self.shelf=shelve.open(self.fname,'n')
        return self
    def __exit__(self,ext_type,exc_value,traceback):
        self.shelf.close()
    def __init__(self,fname):
        self.fname=fname
    def store(self,name,value):
        self.shelf[name]=value        
    def load(self,name):
        return self.shelf[name]

def write(manager):                
    for i in range(100):
        fname='/tmp/{i}.dat'.format(i=i)
        data='The sky is so blue'*100
        manager.store(fname,data)
def read(manager):        
    for i in range(100):
        fname='/tmp/{i}.dat'.format(i=i)        
        manager.load(fname)

Normally, you'd use PickleManager like this:

manager=PickleManager()
manager.load(...)
manager.store(...)

while you'd use the ShelveManager like this:

with ShelveManager('/tmp/shelve.dat') as manager:        
    manager.load(...)
    manager.store(...)

But to test performance, you could do something like this:

python -mtimeit -s'import test' 'with test.ShelveManager("/tmp/shelve.dat") as s: test.read(s)'
python -mtimeit -s'import test' 'test.read(test.PickleManager())'
python -mtimeit -s'import test' 'with test.ShelveManager("/tmp/shelve.dat") as s: test.write(s)'
python -mtimeit -s'import test' 'test.write(test.PickleManager())'

At least on my machine, the results came out like this:

                  read (ms)     write (ms)
PickleManager     9.26          7.92 
ShelveManager     5.32          30.9 

So it looks like ShelveManager may be faster at reading, but PickleManager may be faster at writing.

Be sure to run these tests yourself. Timeit results can vary due to version of Python, OS, filesystem type, hardware, etc.

Also, note my write and read functions generate very small files. You'll want to test this on data more similar to your use case.

unutbu
Nice example, thank you. I'll run it more extensively overnight on my test cases and report back.
ricopan