views:

105

answers:

2

What methods are there to transform a PDF to HTML? It could be anything - online service, software, library. (Opensource preferred. In the last case, php or python would be preferred.) It has to keep the original layout (including page numbers, footnotes and such), keep the images (combining them to one single background image per page is acceptable) and keep the links. It should preferably output valid XHTML and clean up PDF features such as ligatures, but if there is some post-processing required, I can live with that. Something with a clean, relatively semantic HTML output would be great.

The closest one I found was zamzar.org, but it choked on links. (Also, the HTML output is an ugly heap of absolutely positioned divs and needs post-processing because of encoding problems.)

A: 

Few years ago I was using ABBYY PDF Transformer and it was nice for simple documents

dev-null-dweller
A: 

I worked with iText library, and I found it good to parse the PDF structure (I used it to search for text). It's a library that parses a PDF and creates an object model out of it, so you will need to code the HTML generator, but it should be not too difficult.

garph0