views:

185

answers:

5

I am building an application to distribute to fellow academics. The application will take three parameters that the user submits and output a list of dates and codes related to those events. I have been building this using a dictionary and intended to build the application so that the dictionary loaded from a pickle file when the application called for it. The parameters supplied by the user will be used to lookup the needed output.

I selected this structure because I have gotten pretty comfortable with dictionaries and pickle files and I see this going out the door with the smallest learning curve on my part. There might be as many as two million keys in the dictionary. I have been satisfied with the performance on my machine with a reasonable subset. I have already thought through about how to break the dictionary apart if I have any performance concerns when the whole thing is put together. I am not really that worried about the amount of disk space on their machine as we are working with terabyte storage values.

Having said all of that I have been poking around in the docs and am wondering if I need to invest some time to learn and implement an alternative data storage file. The only reason I can think of is if there is an alternative that could increase the lookup speed by a factor of three to five or more.

A: 

Here are three things you can try:

  1. Compress the pickled dictionary with zlib. pickle.dumps(dict).encode("zlib")
  2. Make your own serializing format (shouldn't be too hard).
  3. Load the data in a sqlite database.
Unknown
+2  A: 

To get fast lookups, use the standard Python dbm module (see http://docs.python.org/library/dbm.html) to build your database file, and do lookups in it. The dbm file format may not be cross-platform, so you may want to to distrubute your data in Pickle or repr or JSON or YAML or XML format, and build the dbm database the user runs your program.

pts
+6  A: 

The standard shelve module will give you a persistent dictionary that is stored in a dbm style database. Providing that your keys are strings and your values are picklable (since you're using pickle already, this must be true), this could be a better solution that simply storing the entire dictionary in a single pickle.

Example:

>>> import shelve
>>> d = shelve.open('mydb')
>>> d['key1'] = 12345
>>> d['key2'] = value2
>>> print d['key1']
12345
>>> d.close()

I'd also recommend Durus, but that requires some extra learning on your part. It'll let you create a PersistentDictionary. From memory, keys can be any pickleable object.

mhawke
FYI: http://docs.python.org/library/shelve.html
+2  A: 

How much memory can your application reasonably use? Is this going to be running on each user's desktop, or will there just be one deployment somewhere?

A python dictionary in memory can certainly cope with two million keys. You say that you've got a subset of the data; do you have the whole lot? Maybe you should throw the full dataset at it and see whether it copes.

I just tested creating a two million record dictionary; the total memory usage for the process came in at about 200MB. If speed is your primary concern and you've got the RAM to spare, you're probably not going to do better than an in-memory python dictionary.

John Fouhy
+1  A: 

See this solution at SourceForge, esp. the "endnotes" documentation:

y_serial.py module :: warehouse Python objects with SQLite

"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."

http://yserial.sourceforge.net

code43