views:

203

answers:

3
  • conversion from multiple non-graphical document formats to and from HTML (e.g. doc<->HTML, pdf<->html, odt<->html, etc.)
  • command line or API (Java API is preferable)
  • cross-platform
  • commercial or open source

Are there any well known solutions that meet/exceed these requirements?

+1  A: 

OpenOffice has a rich API that supports conversion between the various supported formats. Check out this question. It recommends using JODConverter.

codelogic
A: 

With DocBook you can export to various output formats, but reverting is always hard. For pdf you can try iText

A: 

I (having written an all in one Tex/LaTeX -> HTML and ASCII text and RTF convertor), would say this would be quite an undertaking.

The problem with this, is these various 'document' formats are intended for rather different purposes. And while there are indeed such conversion tools between some of these formats, there is often a conceptual disparity in the structure, meaning and implementation of 'document' and it is very often is necessary to trade off on features supported by one format to hack together an acceptable output in another. For example, PDF is very strong in presentation, precise positioning and support for fonts, where as HTML is more concerned about structure with practically no considuration for these things (without CSS).

I am curious how do you envision such an API being used, when usually someone simply wants a conversion program?

Roger Nelson