views:

172

answers:

4

After having successfully build a static data structure (see here), I would want to avoid having to build it from scratch every time a user requests an operation on it. My naïv first idea was to dump the structure (using python's pickle) into a file and load this file for each query. Needless to say (as I figured out), this turns out to be too time-consuming, as the file is rather large.

Any ideas how I can easily speed up this thing? Splitting the file into multiple files? Or a program running on the server? (How difficult is this to implement?)

Thanks for your help!

+2  A: 

My suggestion would be not to rely on having an object structure. Instead have a byte array (or mmap'd file etc) which you can do random access operations on and implement the cross-referencing using pointers inside that structure.

True, it will introduce the concept of pointers to your code, but it will mean that you don't need to unpickle it each time the handler process starts up, and it will also use a lot less memory (as there won't be the overhead of python objects).

As your database is going to be fixed during the lifetime of a handler process (I imagine), you won't need to worry about concurrent modifications or locking etc.

Even if you did what you suggest, you shouldn't have to rebuild it on every user request, just keep an instance in memory in your worker process(es), which means it won't take too long to build as you only build it when a new worker process starts.

MarkR
Just to check if I've understood this correctly: If I have a tree-structure, and want to access it using mmap, I create a file as follows:Let each line represents a node. Then the root is on the first line, together with its value (or a pointer to it) and a list of the line numbers of its children. (And the children defined analogously.)Is this the idea?
jacob
ok, I implemented it... über-fast! great! Thanks for the hint about pointers.
jacob
+4  A: 

You can dump it in a memory cache (such as memcached).

This method has the advantage of cache key invalidation. When underlying data changes you can invalidate your cached data.

EDIT

Here's the python implementation of memcached: python-memcached. Thanks NicDumZ.

muhuk
There is a Python implementation of the memcached client: http://www.tummy.com/Community/software/python-memcached/ To be short, a daemon will be running on your Server, serving objects to your processes. Please note that python-memcached relies on pickling, so you might want to try alex's suggestion if using memcached alone does not help enough.
NicDumZ
Thanks for the hint about memcached. I would have marked your answer as well, If one could mark two solutions...
jacob
+2  A: 

If you can rebuild your Python runtime with the patches offered in the Unladen Swallow project, you should see speedups of 40% to 150% in pickling, 36% to 56% in unpickling, according to their benchmarks; maybe that might help.

Alex Martelli
+2  A: 

The number one way to speed up your web application, especially when you have lots of mostly-static modules, classes and objects that need to be initialized: use a way of serving files that supports serving multiple requests from a single interpreter, such as mod_wsgi, mod_python, SCGI, FastCGI, Google App Engine, a Python web server... basically anything except a standard CGI script that starts a new Python process for every request. With this approach, you can make your data structure a global object that only needs to be read from a serialized format for each new process—which is much less frequent.

Miles
+1: A mod_wsgi daemon means you can keep the structure in memory without having to sweat the details of pickling and recovering it from being pickled. Replacing CGI with mod_wsgi daemon is an immediate win.
S.Lott