Is there a way to use Acrobat Reader in Perl to save multiple PDF files as HTML files? | ansaurus

tags:

views:

176

answers:

1

Q:

Is there a way to use Acrobat Reader in Perl to save multiple PDF files as HTML files?

Hello everybody,

I am using Xpdf for extracting text from PDF files which works well with -raw option, but now we want to convert the PDF files to HTML files for extracting the HTML formating tags like bold <b>, italics <i> etc with the text. Xpdf with the -html option does work, I have also tried using pdf2html for this but did not find it reliable as tags like <sup> and <sub> where missing.

We are now using Acrobat Reader to save the PDF files as HTML files which gives us all the HTML formatting tags.

Is there a way to use Acrobat Reader in Perl to save multiple PDF files as HTML files?

Thank you.

+2 A:

PDF styling information is completely arbitrary and can't be reliably mapped to HTML in any meaningful way. One strategy that I've had some luck with is to use the -xml option to pdftohtml and then use LibXML to apply some heuristics to the output and come up with a reasonable HTML approximation of the original document.

friedo 2009-07-27 06:24:55

related questions

Autosizing Textarea

Regular expression for parsing links from a webpage?

What are good tools for creating compiled HTML help files (.chm)?

Looking for WYSIWYG HTML editor

Any reason not to start using the HTML 5 doctype?

HTML comments break down

HTML Comments Markup

Setting a div's height in HTML with CSS

Wrapping lists into columns

Is a "Confirm Email" input good practice when user changes email address?

<XMP> Tag

HTML version choice

Options for HTML scraping?

How do you disable browser Autocomplete on web form field / input tag?

How do I make a checkbox toggle from clicking on the text label as well?

Html CSS Editor

Wordpress theme development offline tools

How do I give my web sites an icon for iPhone?

In HTML, how to word-break on a dash?

Detecting font in JavaScript

How do you test layout design across multiple browsers/OSs?

How do I print an HTML document from a web service?

Multiple submit buttons on a HTML form

How can I determine a web user's time zone?

Why doesn't the percentage width child in absolutely positioned parent work in IE7?