views:

65

answers:

3

i am new to subprocess module of python, currently my implementation is not multi processed.

import subprocess,shlex
    def forcedParsing(fname):

        cmd = 'strings "%s"' % (fname)
        #print cmd
        args= shlex.split(cmd)
        try:
            sp = subprocess.Popen( args, shell = False, stdout = subprocess.PIPE, stderr = subprocess.PIPE )
            out, err = sp.communicate()
        except OSError:
            print "Error no %s  Message %s" % (OSError.errno,OSError.message)
            pass

        if sp.returncode== 0:
            #print "Processed %s" %fname
            return out

    res=[]
    for f in file_list: res.append(forcedParsing(f))

my questions:

1) is sp.communicate a good way to go? should i use poll? if i use poll i need a sperate process which monitors if process finished right?

2) should i fork at the for loop?

I would appreciate any help!

+1  A: 

There are several warnings in the subprocess documentation that advise you to use communicate to avoid problems with a processes blocking, so it would be a good idea to use that.

Mark Byers
well subprocess.communicate waits .. so forking will be blocked..
V3ss0n
+1  A: 

1) subprocess.communicate() seems the right option for what you are trying to do. And you don't need to poll the proces, communicate() returns only when it's finished.

2) you mean forking to paralellize work? take a look at multiprocessing (python >= 2.6). Running parallel processes using subprocess is of course possible but it's quite a work, you cannot just call communicate(), which is blocking.

About your code:

cmd = 'strings "%s"' % (fname)
args= shlex.split(cmd)

Why not simply?

args = ["strings", fname]

As for this ugly pattern:

res=[]
for f in file_list: res.append(forcedParsing(f))

You should use list-comprehensions whenever possible:

res = [forcedParsing(f) for f in file_list]
tokland
Good answer too , yes i want to use multiprocessing but current Debian stable only have up to 2.5.x which sucks.. i may change to gentoo/sabayon later .Thanks a lot for correcting my syntax too , this is just example code , in real there are a few condition statments inside so list comprehensions will not be possible.if i fork at loop , subprocess.communicate will just block right ? , thats badnews. SO using poll instead? i just need output when the program exit , not all the time..
V3ss0n
You could try the multiprocessing backport: http://code.google.com/p/python-multiprocessing/.
tokland
+1  A: 

About question 2: forking at the for loop will mostly speed things up if the script's supposed to run on a system with multiple cores/processors. It will consume more memory, though, and will stress IO harder. There will be a sweet spot somewhere that depends on the number of files in file_list, but only benchmarking on a realistic target system can tell you where it is. If you find that number, you could add an if len(file_list) > <your number>: with optional fork() 'ing [Edit: rather, as @tokland say's via multiprocessing if it's available on your Python version (2.6+)] that chooses the most efficient strategy on a per-job basis.

Read about Python profiling here: http://docs.python.org/library/profile.html

If you're on Linux, you can also run time: http://linuxmanpages.com/man1/time.1.php

Jacob Oscarson
Good one , i can limit the number of forks ofcoz .Yes .i am on linux and file list can go 10k + so lets say , like 10 forks at the same time should be ok ( the production server will 8 cores with up to 16 GB of DDR3 RAM) .
V3ss0n