views:

2801

answers:

2

How can I convert PDF files to HTML with Python?

I was thinking something alone the lines of what Google does (or seems to do) to index PDF files.

My final goal is to setup Apache to show the HTML for the PDF files, so anything leading me in that direction would also be appreciated.

+2  A: 

The poppler package provides a pdf2html utility that you might be able to use. There is also a Python binding to libpoppler.

Martin v. Löwis
The python binding is mostly for rendering PDF in a GTK widget/ui, so I am not sure it would help here.
Ali A
I haven't actually used it, but it does expose poppler_page_get_text, which might be useful to the OP.
Martin v. Löwis
Right, but seems a whole big waste of GTK/Glib bindings if that's all the O.P. wants, especially as there are other easier ways that don't depend on a UI toolkit (eg pdf2html you mention). I should say I generally like the bindings, and was the original author. Maybe not in this case though.
Ali A
A: 

This question asks something similar enough, and got some useful answers:

http://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text

Marcos Lara