How can I convert PDF to HTML?

A:

In Perl, you can use the SWISH::Filter plugin SWISH::Filters::Pdf2HTML. (It requires the xpdf package.)

For the reverse (HTML to PDF), see this question.

Ether 2009-10-28 18:07:59

A:

if you're looking for a way to convert PDF to HTML once or twice then I recommend Adobe Online Conversion

If it's an API you're after then http://www.pdfonline.com/ has an SDK that should suit your needs.

If it's a library you're after then please let us know which server-side language you prefer.

Russ Bradberry 2009-10-28 18:22:57

Thanks Russ! I'm using Adobe Online so far. I tried the website and the results are difficult to gauge. But thanks for the help!

2009-10-28 18:47:31

A:

If you are working on a Windows box, I think Amyuni has a library for this as well. Their PDF Document Convertor is accessible as a DLL, can be used widely among the languages supported by Visual Studio, and can convert to RTF, TML, EXCEL, JPEG, and TIFF.

William Daniel 2009-10-29 19:01:15

A:

Given the vagueness of the original question I'm going to go ahead and give a solution that will work with any language that can execute command-line apps. Although it can be a little bit tricky to get setup, OpenOffice can be run in headless mode on a server and, with the help of jodconverter, can convert any file format to any other file format (well, any format conversions that openoffice can handle, that is).

Here are a couple of links that help with the setup:

Karim 2009-10-30 02:04:02

+2 A:

http://www.lowagie.com/iText/ Opensource library for both Java and C#

Aizaz 2009-10-30 04:26:22

This is probably your best bet. Parse the PDF using the library and generate HTML from the data.

TJB 2009-10-30 05:44:30

A:

Our service will take the URL of the webpage that you are wanting to convert, and return the PDF. It's not a library, but takes only 5 minutes to get working, so might be of value? URL is http://fourpdf.com/

Regards, Jake.

Jake Liddell 2009-11-02 12:47:06

Wrong direction.

reinierpost 2009-11-02 12:54:35

Without more information about the domain of what he's trying to achieve, I don't see why this is "Wrong direction". It's a simple solution that can be used programmatically.

Jake Liddell 2009-11-02 13:13:31

it's wrong direction because asker wanted PDF->Html and your answer is Html->Pdf

Sam Holder 2010-05-08 17:21:01

+1 A:

PDFBox at apache has an html extraction capability. http://pdfbox.apache.org/

John Thorhauer 2009-11-23 17:47:52

A:

The pdftohtml program converts pdf to html and xml and preserves position information of the text which is helpful for scraping tables..

It seems to be based on the xpdf library and has a windows binary, too.

Karsten W. 2010-10-04 07:56:43

ansaurus

tags:

views:

answers:

How can I convert PDF to HTML?

related questions