views:

199

answers:

3

I'd like to parallelize my Python program so that it can make use of multiple processors on the machine that it runs on. My parallelization is very simple, in that all the parallel "threads" of the program are independent and write their output to separate files. I don't need the threads to exchange information but it is imperative that I know when the threads finish since some steps of my pipeline depend on their output.

Portability is important, in that I'd like this to run on any Python version on Mac, Linux and Windows. Given these constraints, which is the most appropriate Python module for implementing this? I am tryign to decide between thread, subprocess and multiprocessing, which all seem to provide related functionality.

Any thoughts on this? I'd like the simplest solution that's portable.

Thanks.

+1  A: 

To use multiple processors in CPython your only choice is the multiprocessing module. CPython keeps a lock on it's internals (the GIL) which prevents threads on other cpus to work in parallel. The multiprocessing module creates new processes ( like subprocess ) and manages communication between them.

THC4k
That's not quite true, AFAIK you can release the GIL using the C API, and there are other implementations of Python such as IronPython or Jython which don't suffer from such limitations. I didn't downvote though.
Bastien Léonard
+2  A: 

multiprocessing is a great Swiss-knife type module. It is more general than threads, as you can even perform remote computations. This is therefore the module I would suggest you use.

The subprocess module would also allow you to launch multiple processes, but I found it to be less convenient to use than the new multiprocessing module.

Threads are notoriously subtle, and, with CPython, you are often limited to one core, with them (even though, as noted in one of the comments, the GIL can be released in C code called from Python code).

I believe that most of the functions of the three modules you cite can be used in a platform-independent way. On the portability side, note that multiprocessing only comes in standard since Python 2.6 (a version for some older versions of Python does exist, though). But it's a great module!

EOL
+1  A: 

In a similar case I opted for separate processes and the little bit of necessary communication trough network socket. It is highly portable and quite simple to do using python, but probably not the simpler (in my case I had also another constraint : communication with other processes written in C++).

In your case I would probably go for multiprocess, as python threads are not real threads.

kriss