views:

189

answers:

1

I have written a python script which watches a directory for new subdirectories, and then acts on each subdirectory in a loop. We have an external process which creates these subdirectories. Inside each subdirectory is a text file and a number of images. There is one record (line) in the text file for each image. For each subdirectory my script scans the text file, then calls a few external programs, one detects blank images (custom exe), then a call to "mogrify" (part of ImageMagick) which resizes and converts the images and finally a call to 7-zip which pacakges all of the converted images and text file into a single archive.

The script runs fine, but is currently sequential. Looping over each subdirectory one at a time. It seems to me that this would be a good chance to do some multi-processing, since this is being run on a dual-CPU machine (8 cores total).

The processing of a given subdirectory is independent of all others...they are self-contained.

Currently I am just creating a list of sub-directories using a call to os.listdir() and then looping over that list. I figure I could move all of the per-subdirectory code (conversions, etc) into a separate function, and then somehow create a separate process to handle each subdirectory. Since I am somewhat new to Python, some suggestions on how to approach such multiprocessing would be appreciated. I am on Vista x64 running Python 2.6.

A: 

I agree that the design of this sounds like it could benefit from concurrency. Take a look at the multiprocessing module. You may also want to look at the threading module, and compare speeds. It's difficult to tell exactly how many cores are necessary to gain a benefit from multiprocessing vs. threading and eight cores is well within the range where threading might be faster (yes, despite the GIL).

From a design perspective, my biggest recommendation is to avoid interaction between processes entirely if possible. Have one central thread look for the event that triggers process creation (I'm guessing it's a subdirectory creation?) and then spawn a process to handle the subdirectory. From there on out, the spawned process should not interact with any other processes, ever. From your description it seems like this should be possible.

Lastly, I'd like to add in a word of encouragement for moving to Python 3.0. There is a lot of talk of staying with 2.x but 3.0 does make some real improvements, and as more and more people start moving to Python 3.0, it's going to be more difficult to get tools and support for 2.x.

Imagist
Thanks for the recommendations. the processes are independent. The only other snag is logging. I am writing a log to a filehandler to report on errors in the processing. It looks like multi-processing logging gets complex! The only issue I see with moving to 3.0 is that I use and need pyodbc for connecting to an MS SQL Server. That module only supports up to 2.6 right now.
Bryan Lewis
With respect to logging, a very basic solution could be to do separate logging for each process and then merging the logs together. An improvement could be to use a db, since you mention sqlserver, it could be rather natural (For a quick test of the idea sqlite comes to mind).
Francesco
+1 on the suggestion of using SQLite for logging.
Imagist
Yes, using a DB for logging definitely makes sense and eliminates issues of file corruption. I haven't tried using a DB logging handler...will look at the docs. Thx.
Bryan Lewis
I just wanted to quickly revisit this. I changed this script over to use threading and ti really helped a lot. My test set on the 8-core machine was taking 1 minute under my old (single-threaded) script. Adding in threading dropped that same set to 20 seconds! Not bad.
Bryan Lewis
Bryan glad to hear it worked for you!
Imagist