views:

63

answers:

5

Is there a software for mac os X 10.6.4 that that converts PDF to HTML ?

+3  A: 

Did you try Adobe service @ http://www.adobe.com/products/acrobat/access_onlinetools.html

Or try PDFtoHTML - 0.1b

Adnan
+1  A: 

You may find the results disappointing. I actually wrote an article on the problems with PDF to HTML conversion on our blog (http://www.jpedal.org/PDFblog/?p=402)

mark stephens
A: 

pdftohtml looks like it does what you want.

If you have macports installed, simply issue the following command in terminal to install:

sudo port install pdftohtml
Johnsyweb
+1  A: 

It depends on what your expectations are - the libraries mentioned above will both do the job, as well as is possible, but as Mark Stephens suggests the results are often disappointing.

A major reason is that as formats, the two things have contradictory goals.

PDF is about preserving layout, at the cost of content and structure.

HTML and CSS are the complete opposite - the actual physical rendering can change significantly as the page is resized, but the content and relationship of the elements is preserved, even at the cost of aesthetics.

In a typical PDF document, rather than paragraphs of text, we have objects placed at X/Y co-ordinates.

These objects may be strings, but the PDF viewer has no concept of how lines flow together to form paragraphs, etc, just that it must draw these characters starting at this co-ordinate.

Other ways of looking at it :

On OS X the Quartz graphics layer is also known as 'Display PDF' - it is the layer below Safari - it is what HTML and CSS are converted into when they convert the current layout into something to do shown on the screen.

In typesetting terms, the PDF is the page of laid out type, ready to go to the printer, not the manuscript.

So any PDF-to-Html or Pdf-to-text convertor has to try to 'read' the text, and infer the hidden layout structure purely from what it can 'see'. It is like a human being trying to create an HTML and CSS layout from a printed copy of a magazine.

That is why selecting text from a multi-column PDF document is difficult to do, particularly if there are things like break-out quote boxes, embedded adverts, etc, in the page. It starts to become an AI problem.

JulesLt
A: 

PDFs are tricky stuff, ever tried opening one up in VI? Yeah, it's all junk to me too. PDFs are made (in part) by using PostScript to generate the layout, meaning that the PS must be interpreted to generate that layout. Also, fonts, images and other files are bundled into a pdf doc. Pretty much the antithesis of what HTML is.

If you would like to break free of the command-line and use the already-installed (10.4 or later) program Automator, go right ahead, you can use the "extract pdf text" action to export pdf text to a rtf file, which can easily be converted to HTML. If you would like to automate RTF conversion to HTML, I would recommend JOD Converter in conjunction with OpenOffice.org.

If your pdfs are a bit more complex, as will you solution be also. I'd hate to mention a commerical solution on StackOverflow, but ABBYY FineReader software will be what works best for you, with Corporate or Enterprise edition you can even automate pdf conversion, but the home edition will do just find to convert any pdf to HTML.

Mr. Dave
A brief addendum, ABBYY FineReader has a mac-specific edition, but to the best of my knowledge, it doesn't allow for automated pdf conversion :(
Mr. Dave