views:

208

answers:

2

I have some data stored in a DB that I want to process. DB access is painfully slow, so I decided to load all data in a dictionary before any processing. However, due to the huge size of the data stored, I get an out of memory error (I see more than 2 gigs being used). So I decided to use a disk data structure, and found out that using shelve is an option. Here's what I do (pseudo python code)

def loadData():
    if (#dict exists on disk):
        d = shelve.open(name)
        return d
    else:
        d = shelve.open(name, writeback=True)
        #access DB and write data to dict
        # d[key] = value 
        # or for mutable values
        # oldValue = d[key]
        # newValue = f(oldValue)
        # d[key] = newValue 
        d.close()
        d = shelve.open(name, writeback=True)
        return d

I have a couple of questions,

1) Do I really need the writeBack=True? What does it do?

2) I still get an OutofMemory exception, since I do not exercise any control over when the data is being written to disk. How do I do that? I tried doing a sync() every few iterations but that didn't help either.

Thanks!

A: 

Using the sqlite3 module is probably your best choice here. You might be able to use sqlite entirely in memory anyway since its memory footprint might be a bit smaller than using python objects anyway. It's generally a better choice than using shelve anyway; shelve uses pickle underneath, which is rarely what you want.

Hell, you could just convert your entire existing database to a sqlite database. sqlite is nice and fast.

Aaron Gallagher
+1  A: 

writeback=True forces the shelf to keep in-memory any item ever fetched, and write them back when the shelf is closed. So, it consumes much more memory, and slows down closing.

The advantage of the parameter is that, with it, you don't need the contorted code you show in your comment for mutable items whose mutator is a method -- just

shelf['foobar'].append(23)

works (if shelf was opened with writeback enabled), assuming the item at key 'foobar' is a list of course, while it would silently be a no-operation (leaving the item on disk unchanged) if shelf was opened without writeback -- in the latter case you actually do need to code

thelist = shelf['foobar']
thelist.append(23)
shekf['foobar'] = thelist

in your comment's spirit -- which is stylistically somewhat of a bummer.

However, since you are having memory problems, I definitely recommend not using this dubious writeback option. I think I can call it "dubious" since I was the one proposing and first implementing it, but that was many years ago, and I've mostly repented of doing it -- it generales more confusion (as your Q evidences) than it allows elegance and handiness in moving code originally written to work with dicts (which would use the first idiom, not the second, and thus need rewriting in order to be usable with shelves without traceback). Ah well, sorry, it did seem a good idea at the time.

Alex Martelli
What about `writeback=True` combined with `sync`?
Steven
@Steven, each `sync` empties the shelf's cache, so if you do it often enough you may ameliorate memory pressure (overall speed issues will be as bad, or worse, but spread out over all `sync` calls rather than lumped in the `close` at the end -- this spreading effect may be good or bad, depending).
Alex Martelli
@Alex, what is the caching strategy when the shelf is opened with writeback set to False?
fsm
Alex Martelli