I would like to convert doc/docx documents to semantic HTML.
Some wishes/requirements:
Semantic HTML such that headers in the document are <h1>, <h2> etc., tables are <table> and so forth.
Should preferably be possible to handle headings, lists, tables and images. Graphs and math formulas is a nice extra.
• Doesn't have to be converted straight from doc/docx to html, could use an intermediary format, such as xml or docbook.
• Should work programatically, and with large number of documents.
The closest thing to a solution I've found so far is http://holloway.co.nz/docvert/index.html, but unfortunately there are many a few bugs, small user base and it can't handle a lot of documents. More of a proof of concept.