views:

708

answers:

7

I want to convert MS word formatted documents to images and pdf. I am well versed with Java. I was looking for apis using which I can do the same.

Can anyone please point me to that ?

If you know the best alternatives even though they are for other languages , please update me about that.

+5  A: 

JODConverter

Snippet from home page:

JODConverter automates all conversions supported by OpenOffice.org, including

* Any format to PDF
      o OpenDocument (Text, Spreadsheet, Presentation) to PDF
      o Word to PDF; Excel to PDF; PowerPoint to PDF
      o RTF to PDF; WordPerfect to PDF; ...
* And more
      o OpenDocument Presentation (odp) to Flash; PowerPoint to Flash
      o RTF to OpenDocument; WordPerfect to OpenDocument
      o Any format to HTML (with limitations)
      o Support for OpenOffice.org 1.0 and old StarOffice formats
      o ...
Stu Thompson
Keep in mind that OOo's support for MS file formats isn't particularly reliable.
xyz
I'll agree with that...It has worked fine for my own use case, though. Definitely a 'YMMV' kind of thing.
Stu Thompson
I was aware of Apache POI.......... JODConverter seems to be great.......... Has anyone tried it on large scale ?
Vaibhav Kamble
See this: http://stackoverflow.com/questions/355447/openoffice-command-line-pdf-creation
OscarRyz
A: 

I'd say IText

See it's imagesupport

It's free, It has a very, very broad coverage (for example 732k google hits), a very large community and tons of free tutorials

For just creating pdf's it is absolutely the thing to check out. For reading worddocs you can also check out the

http://poi.apache.org/ API, a Java api that is build to access MS formats.

If I had to deal with your problem, I think I would end out with combining these two options.

Peter
Just to warn you, combining those technologies is asking for pain. I've tried it, and while it could be done, it's much more difficult than you might expect to make it work well.
Ian McLaird
Then I have to conclude poi is a problem. Cause IText isn't and you could perfectly combine them with a layer between them of course.I can only be a problem as one of the two is a problem
Peter
A: 

Apache POI

Damo
A: 

Just to throw out another option, I'll tell you the one that's given me the "best" result as far as accurately reproducing the appearance of the Word Doc. I used jacob to call MS Word (via COM) to do the conversion. This will give you a perfect recreation of the Word doc as PDF.

However, there are a few downsides.

First, it's not pure java. Jacob is a java wrapper around a C++ native library. This has caused a few class loader issues in a servlet environment (specifically, we have to totally restart the web container in order to restart the application).

Second, you need Word (and the word "save as PDF" extension). This means that it's not portable to platforms besides Windows.

Finally, exception handling is spotty. I've seen it leave a file open (and locked) after it seems like it should be all done. This doesn't happen very often, but it is something I ran into.

All in all, though, it does give you an option if your users want an absolutely perfect recreation. I tried OpenOffice, as well, and the API was better, as Stu Thompson's post shows, but the documents were different enough that my users were unsatisfied, and this was what I ended up doing.

Ian McLaird
A: 

You can do this with docx4j (Apache licence)

It supports 3 ways to generate a PDF from a docx:

  1. via HTML, the docx to html xslt from http://www.codeplex.com/OpenXMLViewer

  2. via iText

  3. via XSL FO (using FOP)

Approaches 2 & 3 are currently only in the SVN code, and will be the recommended way to do it (though at the moment, approach 1 has better table support).

Here is example code: http://dev.plutext.org/trac/docx4j/browser/trunk/docx4j/src/main/java/org/docx4j/samples/CreatePdf.java

plutext
A: 

If you want to add many images into only one PDF, OR, word to pdf files. you can try Nemo Image to PDF. This program is just for converting all kinds of images into PDF format, and you can create individual PDF files for each image, or form only a PDF file for all of the images. You can also set properties, page layout and other features for output PDF files. Get more information in Nemo pdf Official site .

A: 

Although some of the answers to this question provide a more 'native' Java based approach, the quality of the converted documents will be lacking. Depending on your needs this may not be a problem though.

If conversion fidelity is important to you then have a look at the Muhimbi PDF Converter Web Services. It runs on Windows as a service, but can be accessed from any non-Windows web services capable environment including Java and .NET

Disclaimer, I worked on this product. Having said that, it works great.

Muhimbi