views:

628

answers:

4

I am looking to convert any format to/from HTML.

I'd like to support DOC, DOCX, PDF, ODT, RDF, DocBook, and TXT.

I have found lots of format-to-format conversion utilities, but for convenience of implementation, a single tool is best. This will also make it easier to add new formats as the vendor or open-source project expands the library.

The ideal "hub" format is HTML, but I could also work with another hub format.

To run server-side, this preferably should be a Java library, or alternatively a C/C++ library, COM, or a command-line tool; but not a printer driver, online service, or GUI tool. Commercial and Open Source are okay.

+1  A: 

I don't believe such utility/converter exists already since it's rather hard to do certain conversions reasonably. For example, how would YOU handle HTML-to-TXT-to-HTML conversion? What would you strip away? How would you represent different HTML elements in plain text? Furthermore, how would you handle content within content like XML inside TXT transformed to DOCX and then to XHTML?

That said, if I were to make a converter for this kind of purpose, I'd start with Apache POI which is a library for handling Office documents. Then I'd use iText for PDF connectivity, make sure [Office formats] <-> PDF conversion would work as robust as I'd want it to work and then add JDOM for XML handling, test that [Office formats] <-> XML and PDF <-> XML would work as I want to and so on and so forth, you get the picture. I would specifically avoid implementing file type handlers myself since it's very much likely that I'd be reinventing the wheel at that point.

Esko
+10  A: 

OpenOffice.org

From this link:

One of the less well-known features of OpenOffice.org is its ability to run as a service. You can put that ability to some clever use. For example, you can turn OpenOffice.og into a conversion engine and use it to convert documents from one format to another via a Web-based interface or a command-line tool. JODConverter can help you to unleash OpenOffice.org's file conversion capabilities.

This sounds like what you're looking for. It's all in Java too.

This link tells you a little more about JODConverter mentioned above.

jamesh
+1  A: 

This is a non-trivial problem. For example, I've been looking for a robust HTML+CSS to PDF conversion in PHP for the last month and have only managed to get one working reliably albeit incredibly slowly (html2pdf) although I've discovered (from that question) Prince XML, which my initial testing has shown to be a sperb product. It is however expensive.

cletus
+1  A: 

Have a look at Freemarker

I would suggest XML as the "hub" format, then separate out your styling information into an XSLT.

opyate