views:

51

answers:

1

I'm trying to use Python to run pdftotext, but for some reason, my code isn't working. If I run the below, I expect that the content variable would contain the contents of the PDF, but the result I am getting is just an empty string.

Does anybody know what I'm missing?

def getPDFContent(path):
    path = "/path/to/a valid/pdffile.pdf"

    process = subprocess.Popen(["pdftotext", path], shell=False, 
        stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    content, err = process.communicate()[0:2]
    return content, err
+2  A: 

By default pdftotext doesn't output anything on stdout, it instead creates a .txt file with the same base name as the pdf. To get the text on stdout, add - as a second parameter in the call to pdftotext:

process = subprocess.Popen(["pdftotext", path, "-"], shell=False, 
    stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
sth
Good god, you're right. Oye, I hate life sometimes.
mlissner