views:

250

answers:

3

We're considering re-factoring a large application with a complex GUI which is isolated in a decoupled fashion from the back-end, to use the new (Python 2.6) multiprocessing module. The GUI/backend interface uses Queues with Message objects exchanged in both directions.

One thing I've just concluded (tentatively, but feel free to confirm it) is that "object identity" would not be preserved across the multiprocessing interface. Currently when our GUI publishes a Message to the back-end, it expects to get the same Message back with a result attached as an attribute. It uses object identity (if received_msg is message_i_sent:) to identify returning messages in some cases... and that seems likely not to work with multiprocessing.

This question is to ask what "gotchas" like this you have seen in actual use or can imagine one would encounter in naively using the multiprocessing module, especially in refactoring an existing single-process application. Please specify whether your answer is based on actual experience. Bonus points for providing a usable workaround for the problem.

Edit: Although my intent with this question was to gather descriptions of problems in general, I think I made two mistakes: I made it community wiki from the start (which probably makes many people ignore it, as they won't get reputation points), and I included a too-specific example which -- while I appreciate the answers -- probably made many people miss the request for general responses. I'll probably re-word and re-ask this in a new question. For now I'm accepting one answer as best merely to close the question as far as it pertains to the specific example I included. Thanks to those who did answer!

+2  A: 

I have not used multiprocessing itself, but the problems presented are similar to experience I've had in two other domains: distributed systems, and object databases. Python object identity can be a blessing and a curse!

As for general gotchas, it helps if the application you are refactoring can acknowledge that tasks are being handled asynchronously. If not, you will generally end up managing locks, and much of the performance you could have gained by using separate processes will be lost to waiting on those locks. I will also suggest that you spend the time to build some scaffolding for debugging across processes. Truly asynchronous processes tend to be doing much more than the mind can hold and verify -- or at least my mind!

For the specific case outlined, I would manage object identity at the process border when items queued and returned. When sending a task to be processed, annotate the task with an id(), and stash the task instance in a dictionary using the id() as the key. When the task is updated/completed, retrieve the exact task back by id() from the dictionary, and apply the newly updated state to it. Now the exact task, and therefore its identity, will be maintained.

Shane Holloway
@Shane, not bad (for the id() idea). As we have relatively few GUI-backend messages (compared to the in-GUI or in-backend message volume) the state-copying shouldn't pose a huge burden. There is the chance of memory leaks if a tagged-and-stored message is never returned, but that should be an exceptional condition resulting from a bug, not a regular occurrence.
Peter Hansen
@Peter, I've dealt with that problem before by adding a timeout value to the message entry -- either user supplied, or implicit. If the timeout expires before the reply arrives, dispatch an error handler. You will also then have to handle late replies.As for state copying, you could remove any data no longer needed upon sending from the message object and still maintain the identity property on the sending side. There are other alternatives if your data is not python objects. Use mmap or memcached for memory oriented data, or use references to external urls, files and databases.
Shane Holloway
+1  A: 

Well, of course testing for identity on non-singleton object (es. "a is None" or "a is False") isn't usually a good practice - it might be quick, but a really-quick workaround would be to exchange the "is" for the "==" test and use an incremental counter to define identity:

# this is not threadsafe.
class Message(object):
    def _next_id():
       i = 0
       while True:
            i += 1
            yield i
    _idgen = _next_id()
    del _next_id

    def __init__(self):
        self.id = self._idgen.next()

    def __eq__(self, other):
        return (self.__class__ == other.__class__) and (self.id == other.id)

This might be an idea.

Also, be aware that if you have tons of "worker processes", memory consumption might be far greater than with a thread-based approach.

Alan Franzoni
Thanks Alan. What you show is like what I referred to in my comment to J.F.Sebastian with the exception that we would need a thread lock around the incrementing operation. As for worker processes, in our case the idea is to split only the GUI to ensure good user responsiveness and to minimize the effect on back-end operations from user activity in the GUI.
Peter Hansen
Alan Franzoni
The back-end is actually both multithreaded and asynchronous, with multiple Reactors. It's "only slightly twisted"... uses a package we call "bent". It can't effectively be made purely async (one thread), so we're stuck with locking for such things. Good thought though.
Peter Hansen
A: 

You can try the persistent package from my project GarlicSim. It's LGPL'ed.

http://github.com/cool-RR/GarlicSim/tree/development/garlicsim/garlicsim/misc/persistent/

(The main module in it is persistent.py)

I often use it like this:

# ...
self.identity = Persistent()

Then I have an identity that is preserved across processes.

cool-RR