views:

989

answers:

6

I'm writing a reasonably complex web application. The Python backend runs an algorithm whose state depends on data stored in several interrelated database tables which does not change often, plus user specific data which does change often. The algorithm's per-user state undergoes many small changes as a user works with the application. This algorithm is used often during each user's work to make certain important decisions.

For performance reasons, re-initializing the state on every request from the (semi-normalized) database data quickly becomes non-feasible. It would be highly preferable, for example, to cache the state's Python object in some way so that it can simply be used and/or updated whenever necessary. However, since this is a web application, there several processes serving requests, so using a global variable is out of the question.

I've tried serializing the relevant object (via pickle) and saving the serialized data to the DB, and am now experimenting with caching the serialized data via memcached. However, this still has the significant overhead of serializing and deserializing the object often.

I've looked at shared memory solutions but the only relevant thing I've found is POSH. However POSH doesn't seem to be widely used and I don't feel easy integrating such an experimental component into my application.

I need some advice! This is my first shot at developing a web application, so I'm hoping this is a common enough issue that there are well-known solutions to such problems. At this point solutions which assume the Python back-end is running on a single server would be sufficient, but extra points for solutions which scale to multiple servers as well :)

Notes:

  • I have this application working, currently live and with active users. I started out without doing any premature optimization, and then optimized as needed. I've done the measuring and testing to make sure the above mentioned issue is the actual bottleneck. I'm sure pretty sure I could squeeze more performance out of the current setup, but I wanted to ask if there's a better way.
  • The setup itself is still a work in progress; assume that the system's architecture can be whatever suites your solution.
+2  A: 

I think you can give ZODB a shot.

"A major feature of ZODB is transparency. You do not need to write any code to explicitly read or write your objects to or from a database. You just put your persistent objects into a container that works just like a Python dictionary. Everything inside this dictionary is saved in the database. This dictionary is said to be the "root" of the database. It's like a magic bag; any Python object that you put inside it becomes persistent."

Initailly it was a integral part of Zope, but lately a standalone package is also available.

It has the following limitation:

"Actually there are a few restrictions on what you can store in the ZODB. You can store any objects that can be "pickled" into a standard, cross-platform serial format. Objects like lists, dictionaries, and numbers can be pickled. Objects like files, sockets, and Python code objects, cannot be stored in the database because they cannot be pickled."

I have read it but haven't given it a shot myself though.

Other possible thing could be a in-memory sqlite db, that may speed up the process a bit - being an in-memory db, but still you would have to do the serialization stuff and all. Note: In memory db is expensive on resources.

Here is a link: http://www.zope.org/Documentation/Articles/ZODB1

JV
This does more or less what I'm currently doing - serializes the object with pickle and saving the serialized data to the database. Since I'm trying to remove the overhead of the serialization and deserialization process, ZODB won't help.
taleinat
@taleinat, it only deserializes and serializes when the object goes out of cache and it is requested again.
Unknown
+7  A: 

Be cautious of premature optimization.

Addition: The "Python backend runs an algorithm whose state..." is the session in the web framework. That's it. Let the Django framework maintain session state in cache. Period.

"The algorithm's per-user state undergoes many small changes as a user works with the application." Most web frameworks offer a cached session object. Often it is very high performance. See Django's session documentation for this.

Advice. [Revised]

It appears you have something that works. Leverage to learn your framework, learn the tools, and learn what knobs you can turn without breaking a sweat. Specifically, using session state.

Second, fiddle with caching, session management, and things that are easy to adjust, and see if you have enough speed. Find out whether MySQL socket or named pipe is faster by trying them out. These are the no-programming optimizations.

Third, measure performance to find your actual bottleneck. Be prepared to provide (and defend) the measurements as fine-grained enough to be useful and stable enough to providing meaningful comparison of alternatives.

For example, show the performance difference between persistent sessions and cached sessions.

S.Lott
I have it working, currently live and with active users. I've fiddled all right. And I've done the measuring and testing to make sure this is the actual bottleneck.I'm sure there's plenty more fiddling to be done, but I wanted to ask if there's a better way.
taleinat
@taleinat: Please update the question with these facts. Also, identify your framework, your OS, your database.
S.Lott
@S.Lott: I purposely haven't mentioned my framework, OS etc. because I those can change, and I want to allow solutions with whatever tools fit the job.
taleinat
As for sessions, usually these are stored in Cookies and/or cache and/or database. Using any of these requires serializing my data, which is what I'm trying to avoid.
taleinat
@taleinat: sessions are stored in Cookies: false. Cache and file system are more common. A filesystem session cache (as in Django) might do what you want without you having to do any actual programming.
S.Lott
@S.Lott: You're right, often they are also saved in files. But, again, saving to files requires serializing the data first, and then deserializing it when loading from the file. I don't mind having to program for a solution; I'm looking for new conceptual approaches which can solve my problem.
taleinat
@taleinat: I've been trying to give you a conceptual approach which will solve your problem. Your "state" is just a session, nothing more. Using session state without persistence is the fastest thing there is.
S.Lott
@S.Lott: I appreciate that you're trying to help. Yes, non-persisting state is very fast, but how can one use it with multiple application server Python processes??
taleinat
Excellent comment, S.Lott!
viksit
@taleinat: "but how can one use it with multiple application server Python processes?" Is this a new question? Please ask it as a new question. Is this a change? Please revise your question.
S.Lott
+1  A: 

Another option is to review the requirement for state, it sounds like if the serialisation is the bottle neck then the object is very large. Do you really need an object that large?

I know in the Stackoverflow podcast 27 the reddit guys discuss what they use for state, so that maybe useful to listen to.

Andrew Cox
I listened to most of the podcast you mentioned. The reddit guys didn't really go into much detail, they mostly said that they use memcached to sync some things instead of "sticky sessions", but that sticky sessions are simpler; and still they would use memcached again because it worked out so well.
taleinat
+3  A: 

I think that the multiprocessing framework has what might be applicable here - namely the shared ctypes module.

Multiprocessing is fairly new to Python, so it might have some oddities. I am not quite sure whether the solution works with processes not spawned via multiprocessing.

Rafał Dowgird
Thanks for this! I hadn't noticed that part of the multiprocessing module's functionality. This could be just what I need!
taleinat
The shared memory functionality in multiprocessing requires shared objects to be instances of special classes, so it doesn't work for normal Python objects. However, this is the best suggestion so far and the closest to an answer to my question, so I am accepting it.
taleinat
+2  A: 

First of all your approach is not a common web development practice. Even multi threading is being used, web applications are designed to be able to run multi-processing environments, for both scalability and easier deployment .

If you need to just initialize a large object, and do not need to change later, you can do it easily by using a global variable that is initialized while your WSGI application is being created, or the module contains the object is being loaded etc, multi processing will do fine for you.

If you need to change the object and access it from every thread, you need to be sure your object is thread safe, use locks to ensure that. And use a single server context, a process. Any multi threading python server will serve you well, also FCGI is a good choice for this kind of design.

But, if multiple threads are accessing and changing your object the locks may have a really bad effect on your performance gain, which is likely to make all the benefits go away.

M. Utku ALTINKAYA
A: 

This is Durus, a persistent object system for applications written in the Python programming language. Durus offers an easy way to use and maintain a consistent collection of object instances used by one or more processes. Access and change of a persistent instances is managed through a cached Connection instance which includes commit() and abort() methods so that changes are transactional.

http://www.mems-exchange.org/software/durus/

I've used it before in some research code, where I wanted to persist the results of certain computations. I eventually switched to pytables as it met my needs better.

saffsd
This looks very interesting. However, Durus provides its own base class of which persisted objects must be instances, so it does not work for normal Python objects.
taleinat
You can't change to use subclasses of Persistent, etc even with multiple inheritance?
xitrium