views:

65

answers:

4

I am storing a table using python, and I need persistance.

Essentially I am storing the table as a dictionary string to numbers. And the whole is stored with shelve

self.DB=shelve.open("%s%sMoleculeLibrary.shelve"%(directory,os.sep),writeback=True) 

I use writeback to true as I found the system tend to be unstable if I don't.

So after the computations the system needs to close the database, and store it back. Now the database (the table) is about 540MB, and it is taking ages. The time exploded after the table grew to about 500MB. But I need a much bigger table. In fact I need two of them.

I am probably using the wrong form of persistance. Any suggestions?

A: 

Have you tried pickle/cPickle or json?

msw
`shelve` uses `pickle` as its way of storing objects - if the OP's intent was to store objects, I agree that pickle would be just about the only/best solution, but he specified that the dictionary was a simple string->numbers solution, which lends itself more to JSON, as you said. See my answer for how that JSON could best be stored.
nearlymonolith
I have tried pickle, but as A. Morelli points out shelve is essentially using pickle. In any case the problem is not that it does not work, but that it is too slow for big tables. I will now investigate json. Thanks
Pietro Speroni
+5  A: 

For storing a large dictionary of string : number key-value pairs, I'd suggest a JSON-native storage solution such as MongoDB. It has a wonderful API for Python, Pymongo. MongoDB itself is lightweight and incredibly fast, and json objects will natively be dictionaries in Python. This means that you can use your string key as the object ID, allowing for compressed storage and quick lookup.

As an example of how easy the code would be, see the following:

d = {'string1' : 1, 'string2' : 2, 'string3' : 3}
from pymongo import Connection
conn = Connection()
db = conn['example-database']
collection = db['example-collection']
for string, num in d.items():
    collection.save({'_id' : string, 'value' : num})
# testing
newD = {}
for obj in collection.find():
    newD[obj['_id']] = obj['value']
print newD
# output is: {u'string2': 2, u'string3': 3, u'string1': 1}

You'd just have to convert back from unicode, which is trivial.

nearlymonolith
Thank you.The data is actually a symmetric table number*number-->number, but since shelve wanted strings as keys I was somehow induced in writing it as a table string-->number, where the string is "a_b" with a and b numbers and a<b. I don't know json, pymongo nor mongodb. I will study, test, and then see if it works.Also the size of the table can be massive. 500 MB now is probably only about 1/10th of the total size (of this (!) experiments, other can be bigger), so I suspect the best result would be if it was possible to store everything on disk directly.Thanks again, Pietro
Pietro Speroni
A: 

How much larger? What are the access patterns? What kinds of computation do you need to do on it?

Keep in mind that you are going to have some performance limits if you can't keep the table in memory no matter how you do it.

You may want to look at going to SQLAlchemy, or directly using something like bsddb, but both of those will sacrifice simplicity of code. However, with SQL you may be able to offload some of the work to the database layer depending on the workload.

Walter Mundt
I am developing a theoretical algorithm, so the problem I am using now has probably tables of a few giga. But as people will use the algorithm for other problems (mainly in systems biology, think big, then increase) it is important to find a solution that can scale up. The access is random, and each term will be accessed few times. The only computation I need to do is to get the value, calculate the value if it is not there, and store it. I was considering using MySQL so that the DB was not in memory. But it would make the code more complex, and slower. Thanks.
Pietro Speroni
A: 

Check for BerkeleyDB it's free, fast and have many features. And... it have python bindings.

Odomontois