ansaurus

Question

Very large input and piping using subprocess.Popen

Answer 1

+1 A:

Popen has a bufsize parameter that will limit the size of the buffer in memory. If you don't want the files in memory at all, you can pass file objects as the stdin and stdout parameters.

"bufsize, if given, has the same meaning as the corresponding argument to the built-in open() function: 0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size. A negative bufsize means to use the system default, which usually means fully buffered. The default value for bufsize is 0 (unbuffered)." http://docs.python.org/library/subprocess.html

2010-10-21 19:24:31

Straight from the same docs, under the `communicate` method: "Note The data read is buffered in memory, so do not use this method if the data size is large or unlimited."

André Caron 2010-10-21 19:41:53

I posted the code above. This code definitely leads to the python process heading toward the stratosphere in terms of memory usage, so I am definitely missing some detail....

seandavi 2010-10-21 19:46:24

Answer 2

A:

However, all the data are buffered to memory ...

Are you using subprocess.Popen.communicate()? By design, this function will wait for the process to finish, all the while accumulating the data in a buffer, and then return it to you. As you've pointed out, this is problematic if dealing with very large files.

If you want to process the data while it is generated, you will need to write a loop using the poll() and .stdout.read() methods, then write that output to another socket/file/etc.

Do be sure to notice the warnings in the documentation against doing this as it is easy to result in a deadlock (the parent process waits for the child process to generate data, who is in turn waiting for the parent process to empty the pipe buffer).

André Caron 2010-10-21 19:39:27

Answer 3

+1 A:

Try to make this small change, see if the efficiency is better.

 for line in samtoolsin.stdout:
        if(line.startswith("@")):
            samtoolsout.stdin.write(line)
        else:
            linesplit = line.split("\t")
            if(linesplit[10]=="*"):
                linesplit[9]="*"
            samtoolsout.stdin.write("\t".join(linesplit))

anijhaw 2010-10-21 19:48:06

That was the issue, anijhaw. Thanks for noticing.

seandavi 2010-10-21 19:54:32

Answer 4

A:

I was using the .read() method on the stdout stream. Instead, I simply needed to read directly from the stream in the for loop above. The corrected code does what I expected.

#!/usr/bin/env python
import os
import sys
import subprocess

def main(infile,reflist):
    print infile,reflist
    samtoolsin = subprocess.Popen(["samtools","view",infile],
                                  stdout=subprocess.PIPE,bufsize=1)
    samtoolsout = subprocess.Popen(["samtools","import",reflist,"-",
                                    infile+".tmp"],stdin=subprocess.PIPE,bufsize=1)
    for line in samtoolsin.stdout:
        if(line.startswith("@")):
            samtoolsout.stdin.write(line)
        else:
            linesplit = line.split("\t")
            if(linesplit[10]=="*"):
                linesplit[9]="*"
            samtoolsout.stdin.write("\t".join(linesplit))

seandavi 2010-10-21 19:53:28

ansaurus

tags:

views:

answers:

Very large input and piping using subprocess.Popen

related questions