views:

10824

answers:

13

Hi,

I need to convert a Word document into HTML file(s) in Java. The function will take input an word document and the output will be html file(s) based on the number of pages the word document has i.e. if the word document has 3 pages then there will be 3 html files generated having the required page break.

I searched for open source/non-commercial APIs which can convert doc to html but for no result. Anybody who have done this type of job before please help.

Thanks

A: 

You'd have to find the MS word doc specification ( since it is basically a binary dump of whatever is in word at that point in time ), and slowly go through it element by element converting ms word "objects/states" to the html equiv. you might be able to find a script to do it for u since this really isn't fun work and i'd advise against it ( converting file formats or even reading from commercial files on your own is always hard and often incomplete ). PS: just google doc2html

DavidG
Have you ever looked at the specification? (Scratch that, have you ever investigated the inconsistencies between the .rtf file containing the spec, to the specified format?) -- This is unfeasible, way, way too much work while there are other solutions available.
Arafangion
i did say it was hard and specifications were often incomplete, and advised against it.
DavidG
+2  A: 

Here are some starting points for you. Good luck.

On Microsoft's website, you can find documentation for the .doc format, and on the ECMA website, the .docx format. Microsoft has a category for Java on their OpenXML developer blog, including a post specifically about converting OpenXML to XHTML in Java.

lewinski
A: 

UNO interface into OpenOffice?

Brian Knoblauch
A: 

http://www.theserverside.com/news/thread.tss?thread_id=41942#216880 -- this has worked quite well for me earlier

anjanb
A: 

If you are targeting word 2007 files using the ooxml format then this article might help. And there is the Ooxml4j project which is implementing ooxml for Java library.

If you are targeting the binary files though...thats another problem.

Vincent Ramdhanie
+2  A: 

I've used the following approach successfully in production systems where the new MS Word XML format isn't available:

Spawn a process that does something similar to:

http://www.oooninja.com/2008/02/batch-command-line-file-conversion-with.html

You'd probably want to start openoffice up once at startup of your program, and call the python script as many times during your program that you need to (with some sort of checking to ensure the ooffice process is always there).

The other option is to spawn the following sort of command every time you need to do the conversion:

ooffice -headless "macro://<path to ooffice vb macro to convert, with parameter pointing to file>"

I've used the macro approach multiple times and it works well (sorry, I don't have the macro code available).

While there are mechanisms for doing it via MS Word, they're not easy from Java, and do require other support programs to drive MS Word via OLE.

I've used abiword before too, which works well for many documents, but does get confused with more complex documents (ooffice seems to handle everything I've thrown at it). Abiword has a slightly easier command line interface for conversion than ooffice.

Jamie Love
A: 

here's something used by someone who's been doing this for a while -- http://www.jroller.com/rickard/entry/word_to_html_in_java

anjanb
+2  A: 

We use tm-extractors (http://mvnrepository.com/artifact/org.textmining/tm-extractors), and fall back to the commercial Aspose (http://www.aspose.com/). Both have native Java APIs.

Chase Seibert
A: 

import officetools.OfficeFile; // package available at www.dancrintea.ro/doc-to-pdf/ ... FileInputStream fis = new FileInputStream(new File("test.doc")); FileOutputStream fos = new FileOutputStream(new File("test.html")); OfficeFile f = new OfficeFile(fis,"localhost","8100", true); f.convert(fos,"html");

All possible conversions:

doc --> pdf, html, txt, rtf

xls --> pdf, html, csv

ppt --> pdf, swf

html --> pdf

A: 

www.dancrintea.ro/doc-to-pdf/

A: 

If its a docx, you could use docx4j (ASL v2). This uses XSLT to create the HTML.

However, it will give you a single HTML for the whole document.

If you wanted an HTML per page, you could do something with the lastRenderedPageBreak tag that Word puts into the docx (assuming you used Word to create it).

plutext
A: 

It is easier to do this in the new MS word docx as the format is in XML. You can use an XSL to transform the Word doc in XML format to an HTML format.

If however your Word doc is in an old version, you can use POI library http://poi.apache.org/ and then access that and generate a Java object and from that point on you can easily convert it to an HTML format using an HTML java library

http://www.dom4j.org/dom4j-1.4/apidocs/org/dom4j/io/HTMLWriter.html

A: 

I see this thread turns up in external links and has the occasional post so I thought I'd post an update (hope no one minds). OpenOffice continues to evolve and release 3.2 improves the word import export filters again. OpenOffice and Java can run on many platforms so Java systems can make use of the OpenOffice UNO API directly to import/manipulate/export documents in many formats (including word and pdf) or use a library like JODReports or Docmosis to facilitate. Both have free/open options.

jowierun