tags:

views:

95

answers:

3

Hi,

I was wondering if there was a way to run a command line executable in python, but pass it the argument values from memory, without having to write the memory data into a temporary file on disk. From what I have seen, it seems to that the subprocess.Popen(args) is the preferred way to run programs from inside python scripts.

For example, I have a pdf file in memory. I want to convert it to text using the commandline function pdftotext which is present in most linux distros. But I would prefer not to write the in-memory pdf file to a temporary file on disk.

pdfInMemory = myPdfReader.read()
convertedText = subprocess.<method>(['pdftotext', ??]) <- what is the value of ??

what is the method I should call and how should I pipe in memory data into its first input and pipe its output back to another variable in memory?

I am guessing there are other pdf modules that can do the conversion in memory and information about those modules would be helpful. But for future reference, I am also interested about how to pipe input and output to the commandline from inside python.

Any help would be much appreciated.

+1  A: 

Popen.communicate from subprocess takes an input parameter that is used to send data to stdin, you can use that to input your data. You also get the output of your program from communicate, so you don't have to write it into a file.

The documentation for communicate explicitly warns that everything is buffered in memory, which seems to be exactly what you want to achieve.

Fabian
+1  A: 

with Popen.communicate:

import subprocess
out, err = subprocess.Popen(["pdftotext", "-", "-"], stdout=subprocess.PIPE).communicate(pdf_data)
tokland
Another problem, not directly related is converting the memory variable into a "seekable stream". Because right now I am getting an error saying "Error: Document base stream is not seekable". I presume there is some method/module I can pass the pdf_data to to make it a seekable stream?
Chaitanya
@Chaitanya. This is a pdftotext regression bug which was already solved: http://bugs.freedesktop.org/show_bug.cgi?id=7334, update your poppler package. BTW, you cannot build a "seekable stream" in this scenario, the output is being written to process file-descriptor, nothing Python can do there.
tokland
So its complaining about the output from pdftotext not being seekable then, not the input data file. Fedora 12's software updater doesn't seem to update it, and neither does doing 'su -c 'yum update poppler'. I've downloaded and upzipped the version 0.14 from here, http://poppler.freedesktop.org/ but can't seem to install it (make and make install fail)
Chaitanya
Sorry to keep coming back to this, but how to I pass optional parameters using POpen. I am using a temporary file as suggested by greggo below. I want to preserve the layout by running "pdftotext -layout". I tried replacing the "-" in Popen with "-layout", replacing "pdftotext" with "pdftotext -layout" and passing it into communicate, etc. None of it works. I just get an empty text back.
Chaitanya
+1  A: 

os.tmpfile is useful if you need a seekable thing. It uses a file, but it's nearly as simple as a pipe approach, no need for cleanup.

tf=os.tmpfile()
tf.write(...)
tf.seek(0)
subprocess.Popen(  ...    , stdin = tf)

This may not work on Posix-impaired OS 'Windows'.

greggo
This works too. To expand for future users, my 4th line above is this "out, err = subprocess.Popen(["pdftotext", "-", "-"], stdin = tf, stdout=subprocess.PIPE ).communicate()" . After this, the variable 'out' contains the pdf in text format.
Chaitanya