views:

1402

answers:

9

Hello,

I was running some dynamic programming code (trying to brute-force disprove the Collatz conjecture =P) and I was using a dict to store the lengths of the chains I had already computed. Obviously, it ran out of memory at some point. Is there any easy way to use some variant of a dict which will page parts of itself out to disk when it runs out of room? Obviously it will be slower than an in-memory dict, and it will probably end up eating my hard drive space, but this could apply to other problems that are not so futile.

I realized that a disk-based dictionary is pretty much a database, so I manually implemented one using sqlite3, but I didn't do it in any smart way and had it look up every element in the DB one at a time... it was about 300x slower.

Is the smartest way to just create my own set of dicts, keeping only one in memory at a time, and paging them out in some efficient manner?

+8  A: 

Hash-on-disk is generally addressed with Berkeley DB or something similar - several options are listed in the Python Data Persistence documentation. You can front it with an in-memory cache, but I'd test against native performance first; with operating system caching in place it might come out about the same.

Parand
+3  A: 

Last time I was facing a problem like this, I rewrote to use SQLite rather than a dict, and had a massive performance increase. That performance increase was at least partially on account of the database's indexing capabilities; depending on your algorithms, YMMV.

A thin wrapper that does SQLite queries in __getitem__ and __setitem__ isn't much code to write.

Charles Duffy
How exactly would you use sqlite's indexing? the way I did it here was to create a table like this: "cur.execute('create table vals (indx INTEGER, chainlen INTEGER)')", then I "cur.execute('SELECT * from vals where indx=%d' % i)" for a lookup.
Claudiu
create table vals (indx INTEGER PRIMARY KEY, chainlen INTEGER)
Vinko Vrsalovic
@Claudiu - my program was such that I could implement some logic in the database layer, so I could let the DB do filtering and such; it was more than just a dumb store.
Charles Duffy
You could also use a cache-size pragma to tell sqlite to use more memory for its cache: http://sqlite.org/pragma.html
John Fouhy
A: 

You should bring more than one item at a time if there's some heuristic to know which are the most likely items to be retrieved next, and don't forget the indexes like Charles mentions.

Vinko Vrsalovic
+3  A: 

With a little bit of thought it seems like you could get the shelve module to do what you want.

Therms
+2  A: 

The shelve module may do it; at any rate, it should be simple to test. Instead of:

self.lengths = {}

do:

import shelve
self.lengths = shelve.open('lengths.shelf')

The only catch is that keys to shelves must be strings, so you'll have to replace

self.lengths[indx]

with

self.lengths[str(indx)]

(I'm assuming your keys are just integers, as per your comment to Charles Duffy's post)

There's no built-in caching in memory, but your operating system may do that for you anyway.

[actually, that's not quite true: you can pass the argument 'writeback=True' on creation. The intent of this is to make sure storing lists and other mutable things in the shelf works correctly. But a side-effect is that the whole dictionary is cached in memory. Since this caused problems for you, it's probably not a good idea :-) ]

John Fouhy
i have actually tried this, but it was way too slow.. i think i need some kind of manual paging solution to get any reasonable speed.
Claudiu
+7  A: 

The 3rd party shove module is also worth taking a look at. It's very similar to shelve in that it is a simple dict-like object, however it can store to various backends (such as file, SVN, and S3), provides optional compression, and is even threadsafe. It's a very handy module

from shove import Shove

mem_store = Shove()
file_store = Shove('file://mystore')

file_store['key'] = value
Matthew Trevor
A: 

I did not try it yet but Hamster DB is promising and has a Python interface.

bortzmeyer
A: 

read answer for this question from GvR ;) Sorting a million 32-bit integers in 2MB of RAM using Python

slav0nic
A: 

I've read you think shelve is too slow and you tried to hack your own dict using sqlite.

Another did this too :

http://sebsauvage.net/python/snyppets/index.html#dbdict

It seems pretty efficient (and sebsauvage is a pretty good coder). Maybe you could give it a try ?

e-satis