views:

176

answers:

1

Hello everybody,

I am using Xpdf for extracting text from PDF files which works well with -raw option, but now we want to convert the PDF files to HTML files for extracting the HTML formating tags like bold <b>, italics <i> etc with the text. Xpdf with the -html option does work, I have also tried using pdf2html for this but did not find it reliable as tags like <sup> and <sub> where missing.

We are now using Acrobat Reader to save the PDF files as HTML files which gives us all the HTML formatting tags.

Is there a way to use Acrobat Reader in Perl to save multiple PDF files as HTML files?

Thank you.

+2  A: 

PDF styling information is completely arbitrary and can't be reliably mapped to HTML in any meaningful way. One strategy that I've had some luck with is to use the -xml option to pdftohtml and then use LibXML to apply some heuristics to the output and come up with a reasonable HTML approximation of the original document.

friedo