views:

2218

answers:

2

How do I execute the following shell command using the Python subprocess module?

echo "input data" | awk -f script.awk | sort > outfile.txt

The input data will come from a string, so I don't actually need echo. I've got this far, can anyone explain how I get it to pipe through sort too?

p_awk = subprocess.Popen(["awk","-f","script.awk"],
                          stdin=subprocess.PIPE,
                          stdout=file("outfile.txt", "w"))
p_awk.communicate( "input data" )

UPDATE: Note that while the accepted answer below doesn't actually answer the question as asked, I believe S.Lott is right and it's better to avoid having to solve that problem in the first place!

+1  A: 

http://www.python.org/doc/2.5.2/lib/node535.html covered this pretty well. Is there some part of this you didn't understand?

Your program would be pretty similar, but the second Popen would have stdout= to a file, and you wouldn't need the output of its .communicate().

geocar
What I don't understand (given the documentation's example) is if I say p2.communicate("input data"), does that actually get sent to p1.stdin?
Tom
You wouldn't. p1's stdin arg would be set to PIPE and you'd write p1.communicate('foo') then pick up the results by doing p2.stdout.read()
geocar
+5  A: 

You'd be a little happier with the following.

import subprocess

awk_sort = subprocess.Popen( ["-c", "awk -f script.awk | sort > outfile.txt" ],
    stdin= subprocess.PIPE, shell=True )
awk_sort.communicate( "input data\n" )
awk_sort.wait()

Delegate part of the work to the shell. Let it connect two processes with a pipeline.

You'd be a lot happier rewriting 'script.awk' into Python, eliminating awk and the pipeline.

Edit. Some of the reasons for suggesting that awk isn't helping.

[There are too many reasons to respond via comments.]

  1. Awk is adding a step of no significant value. There's nothing unique about awk's processing that Python doesn't handle.

  2. The pipelining from awk to sort, for large sets of data, may improve elapsed processing time. For short sets of data, it has no significant benefit. A quick measurement of awk >file ; sort file and awk | sort will reveal of concurrency helps. With sort, it rarely helps because sort is not a once-through filter.

  3. The simplicity of "Python to sort" processing (instead of "Python to awk to sort") prevents the exact kind of questions being asked here.

  4. Python -- while wordier than awk -- is also explicit where awk has certain implicit rules that are opaque to newbies, and confusing to non-specialists.

  5. Awk (like the shell script itself) adds Yet Another Programming language. If all of this can be done in one language (Python), eliminating the shell and the awk programming eliminates two programming languages, allowing someone to focus on the value-producing parts of the task.

Bottom line: awk can't add significant value. In this case, awk is a net cost; it added enough complexity that it was necessary to ask this question. Removing awk will be a net gain.

Sidebar Why building a pipeline (a | b) is so hard.

When the shell is confronted with a | b it has to do the following.

  1. Fork a child process of the original shell. This will eventually become b.

  2. Build an os pipe. (not a Python subprocess.PIPE) but call os.pipe() which returns two new file descriptors that are connected via common buffer. At this point the process has stdin, stdout, stderr from its parent, plus a file that will be "a's stdout" and "b's stdin".

  3. Fork a child. The child replaces its stdout with the new a's stdout. Exec the a process.

  4. The b child closes replaces its studin with the new b's stdin. Exec the b process.

  5. The b child waits for a to complete.

  6. The parent is waiting for b to complete.

I think that the above can be used recursively to spawn a | b | c, but you have to implicitly parenthesize long pipelines, treating them as if they're a | (b | c).

Since Python has os.pipe(), os.exec() and os.fork(), and you can replace sys.stdin and sys.stdout, there's a way to do the above in pure Python. Indeed, you may be able to work out some shortcuts using os.pipe() and subprocess.Popen.

However, it's easier to delegate that operation to the shell.

S.Lott
That's pretty evil!
Ali A
Can you explain what the "-c" does?
Tom
And I think Awk is actually a good fit for what I am doing, the code is shorter and simpler than the equivalent Python code (it's a domain specific language after all.)
Tom
-c tells the shell (the actual application your starting) that the following argument is a command to run. In this case, the command is a shell pipeline.
S.Lott
"the code is shorter" does not -- actually -- mean simpler. It only means shorter. Awk has a lot of assumptions and hidden features that make the code very hard to work with. Python, while longer, is explicit.
S.Lott
Ok, thanks, my code is working now, but I'm not going mark it as the accepted answer yet as I think there must be a way to do this without punting to the shell and having to deal with the escaping issues etc. that that raises. And I replaced "> outfile.txt" with stdout=file("outfile.txt","w").
Tom
Tom
And, that doesn't change the original question, which is how to use subprocess.Popen. Awk and sort are only used for illustration as potential answerers are likely to have them to test with.
Tom
The original question (how to assemble a shell pipeline with Popen) is something that's (a) complex and (b) never necessary. Using the shell or eliminating the complexity are better approaches.
S.Lott
Ok, you might have convinced me :) I am still going to keep using Awk for now as rewriting in Python isn't feasible in the short term, however I can see it's simpler/more reasonable to let the shell handle the pipelining.
Tom