ansaurus

Question

Converting a pdf to text/html in python so I can parse it

Answer 1

+1 A:

Have you checked out PDFMiner?

Cal Jacobson 2010-09-03 16:46:17

I have had a look at it, but I have a hard time figuring out how to implement it in my python script. Any sample code would be greatly appreciated :)

Thomas Jensen 2010-09-03 18:13:56

There's http://www.unixuser.org/~euske/python/pdfminer/programming.html - that doesn't seem hard to discover...

loevborg 2010-09-03 18:23:11

I have read this, but i am still new to python, so I dont know from the examples shown at the website how to convert a pdf to html/text (I have also read the examples at http://denis.papathanasiou.org/?p=343, which left me just as confused).

Thomas Jensen 2010-09-03 18:27:14

Try the command-line version first, which should be straightforward. You can later replace it with a python library call, which may have advantages (faster, less overhead), but should otherwise be pretty much the same.

loevborg 2010-09-03 18:32:52

Also [this sample](http://nullege.com/codes/show/src%40pdfminer-20100424%40tools%40pdf2txt.py/3/pdfminer.pdfparser.PDFParser) might be helpful. (Meta-advice: try nullege.com or Google Code Search for examples of library use)

loevborg 2010-09-03 18:44:40

Answer 2

+1 A:

It's not exactly magic. I suggest

downloading the PDF file to a temp directory,
calling out to an external program to extract the text into a (temp) text file,
reading the text file.

For text extraction command-line utilities you have a number of possibilities and there may be others not mentioned in the link (perhaps Java-based). Try them first to see if they fit your needs. That is, try each step separately (finding the links, downloading the files, extracting the text) and then piece them together. For calling out, use subprocess.Popen or subprocess.call().

loevborg 2010-09-03 18:29:24

Thanks for the answer. In the end i chose to just use the adobe online conversion tool (see the code above).

Thomas Jensen 2010-09-07 09:48:29

ansaurus

tags:

views:

answers:

Converting a pdf to text/html in python so I can parse it

related questions