tags:

views:

866

answers:

13

I have a python module that makes use of a huge dictionary global variable, currently I put the computation code in the top section, every first time import or reload of the module takes more then one minute which is totally unacceptable. How can I save the computation result somewhere so that the next import/reload doesn't have to compute it? I tried cPickle, but loading the dictionary variable from a file(1.3M) takes approximately the same time as computation.

To give more information about my problem,

FD = FreqDist(word for word in brown.words()) # this line of code takes 1 min
+2  A: 

I assume you've pasted the dict literal into the source, and that's what's taking a minute? I don't know how to get around that, but you could probably avoid instantiating this dict upon import... You could lazily-instantiate it the first time it's actually used.

David Grant
FD = FreqDist(word for word in brown.words()) # this line of code takes one minute
btw0
If FD is the same every time, paste the results of "print FD" right into your source file and skip the computation step.
David Grant
FD is a huge third-party class instance, and can't be written as a literal.
btw0
+1  A: 

You can use a shelve to store your data on disc instead of loading the whole data into memory. So startup time will be very fast, but the trade-off will be slower access time.

Shelve will pickle the dict values too, but will do the (un)pickle not at startup for all the items, but only at access time for each item itself.

Peter Hoffmann
+3  A: 

Calculate your global var on the first use.

class Proxy:
    @property
    def global_name(self):
        # calculate your global var here, enable cache if needed
        ...

_proxy_object = Proxy()
GLOBAL_NAME = _proxy_object.global_name

Or better yet, access necessery data via special data object.

class Data:
    GLOBAL_NAME = property(...)

data = Data()

Example:

from some_module import data

print(data.GLOBAL_NAME)

See Django settings.

J.F. Sebastian
+2  A: 

You could try using the marshal module instead of the c?Pickle one; it could be faster. This module is used by python to store values in a binary format. Note especially the following paragraph, to see if marshal fits your needs:

Not all Python object types are supported; in general, only objects whose value is independent from a particular invocation of Python can be written and read by this module. The following types are supported: None, integers, long integers, floating point numbers, strings, Unicode objects, tuples, lists, sets, dictionaries, and code objects, where it should be understood that tuples, lists and dictionaries are only supported as long as the values contained therein are themselves supported; and recursive lists and dictionaries should not be written (they will cause infinite loops).

Just to be on the safe side, before unmarshalling the dict, make sure that the Python version that unmarshals the dict is the same as the one that did the marshal, since there are no guarantees for backwards compatibility.

ΤΖΩΤΖΙΟΥ
A: 

Expanding on the delayed-calculation idea, why not turn the dict into a class that supplies (and caches) elements as necessary?

You might also use psyco to speed up overall execution...

HUAGHAGUAH
You might also check out pyrex/cython.
Jason Baker
psyco seems to only run on 32bit systems still
A: 

OR you could just use a database for storing the values in? Check out SQLObject, which makes it very easy to store stuff to a database.

Daren Thomas
+15  A: 

Just to clarify: the code in the body of a module is not executed every time the module is imported - it is run only once, after which future imports find the already created module, rather than recreating it. Take a look at sys.modules to see the list of cached modules.

However, if your problem is the time it takes for the first import after the program is run, you'll probably need to use some other method than a python dict. Probably best would be to use an on-disk form, for instance a sqlite database, one of the dbm modules.

For a minimal change in your interface, the shelve module may be your best option - this puts a pretty transparent interface between the dbm modules that makes them act like an arbitrary python dict, allowing any picklable value to be stored. Here's an example:

# Create dict with a million items:
import shelve
d = shelve.open('path/to/my_persistant_dict')
d.update(('key%d' % x, x) for x in xrange(1000000))
d.close()

Then in the next process, use it. There should be no large delay, as lookups are only performed for the key requested on the on-disk form, so everything doesn't have to get loaded into memory:

>>> d = shelve.open('path/to/my_persistant_dict')
>>> print d['key99999']
99999

It's a bit slower than a real dict, and it will still take a long time to load if you do something that requires all the keys (eg. try to print it), but may solve your problem.

Brian
+1  A: 

A couple of things that will help speed up imports:

  1. You might try running python using the -OO flag when running python. This will do some optimizations that will reduce import time of modules.
  2. Is there any reason why you couldn't break the dictionary up into smaller dictionaries in separate modules that can be loaded more quickly?
  3. As a last resort, you could do the calculations asynchronously so that they won't delay your program until it needs the results. Or maybe even put the dictionary in a separate process and pass data back and forth using IPC if you want to take advantage of multi-core architectures.

With that said, I agree that you shouldn't be experiencing any delay in importing modules after the first time you import it. Here are a couple of other general thoughts:

  1. Are you importing the module within a function? If so, this can lead to performance problems since it has to check and see if the module is loaded every time it hits the import statement.
  2. Is your program multi-threaded? I have seen occassions where executing code upon module import in a multi-threaded app can cause some wonkiness and application instability (most notably with the cgitb module).
  3. If this is a global variable, be aware that global variable lookup times can be significantly longer than local variable lookup times. In this case, you can achieve a significant performance improvement by binding the dictionary to a local variable if you're using it multiple times in the same context.

With that said, it's a tad bit difficult to give you any specific advice without a little bit more context. More specifically, where are you importing it? And what are the computations?

Jason Baker
+1  A: 
  1. Factor the computationally intensive part into a separate module. Then at least on reload, you won't have to wait.

  2. Try dumping the data structure using protocol 2. The command to try would be cPickle.dump(FD, protocol=2). From the docstring for cPickle.Pickler:

    Protocol 0 is the
    only protocol that can be written to a file opened in text
    mode and read back successfully.  When using a protocol higher
    than 0, make sure the file is opened in binary mode, both when
    pickling and unpickling.
    
fivebells
It seems having the code in a separate module doesn't help much, when reload this module, the code in separate module again gets executed.
btw0
+2  A: 

If the 'shelve' solution turns out to be too slow or fiddly, there are other possibilities:

saffsd
A: 

There's another pretty obvious solution for this problem. When code is reloaded the original scope is still available.

So... doing something like this will make sure this code is executed only once.

try:
    FD
except NameError:
    FD = FreqDist(word for word in brown.words())
+2  A: 

shelve gets really slow with large data sets. I've been using redis quite successfully, and wrote a FreqDist wrapper around it. It's very fast, and can be accessed concurrently.

Jacob
A: 

I'm going through this same issue... shelve, databases, etc... are all too slow for this type of problem. You'll need to take the hit once, insert it into an inmemory key/val store like Redis. It will just live there in memory (warning it could use up a good amount of memory so you may want a dedicated box). You'll never have to reload it and you'll just get looking in memory for keys

r = Redis()
r.set(key, word)

word = r.get(key)