views:

4228

answers:

11

Hey there

What I need to do is convert Microsoft Word .doc files to PDFs or images. This has to occur in Java.

I have done a fair bit of investigation already. I've tried Davisor Publishor but it doesn't give me the accuracy that I need - for instance text overlapping in the output document.

Adobe has something called LiveCycle. Anyone tried this? It looks quite massive and a bit of overkill (its an "integrated server solution"). Sounds expensive.

Saying that, it doesn't have to be free, or cheap. Even if you just know some names, please shout them my way.

Many thanks in advance.

Doug

+4  A: 

two sides to this problem.

you can read word docs using apache poi.

you can write pdfs using either iText from Lowagie if you're doing it in java code or FOP if you're using a stylesheet to generate it from XML.

i don't know of one library that does both sides for you automatically.

duffymo
The problem will be I guess, the images and other visual elements.
OscarRyz
Thanks for your reply. Oscar has highlighted one of my issues - would this method above allow me to accurately (very) reproduce the DOC as a PDF or image? Is it a straightforward read-in, write-out? Or would there be manipulation? Thing is, I don't have any control of Docs inputted to the system.
mieze
This is manipulation AFAIK. For in/out see mike's or my own sample below.
OscarRyz
+1  A: 

In addition to duffymo's response:

You could also look at xsl:fo/fop to generate PDF's.

Here are some links to start you off:

http://poi.apache.org/

http://www.lowagie.com/iText/\

http://xmlgraphics.apache.org/fop/

LiveCycle like all other proprietary products is a lot of vendor kool aid but very frustrating to integrate with your solution and is probably an overkill.

If your needs are more imaging based and you have some flexibility to work outside of Java, I'd look at some scanning software and scan everything into a tiff and then use tiff2ps to bulk convert everything to a post script and then use ps2pdf to convert everything to PDF's

http://linux.about.com/library/cmd/blcmdl1_tiff2ps.htm

http://www.linux.com/articles/35022

Deep Kapadia
+12  A: 

If you can relax the pure java requirement then I recommend JODConverter. It is a java library that handles all of the interprocess communication with open office. All it requires is that you have Open Office running in server mode on one of the ports. I do not think there are any more faithful converters than the open office ones, so in that respect it may be the best solution.

I appreciate this is not pure java, but open office is available as an installable package on many platforms. If you need it in a rich client program then perhaps consider setting up a servlet to do the conversions for you.

In my research I could not find any free alternatives for doing conversions from word documents in pure java. There are libraries for reading .doc (POI) and writing PDF (quite a few), but that is not the same thing at all, you would be writing your own converter.

I think there are paid for libraries in pure java, but they may not give you very good conversions.

mike g
Using this approach the generated PDF is VERY acceptable.
OscarRyz
+2  A: 

I don't remember if this uses JODConverter that mike_g mentioned, I guess it does, sounds the same as something I did in the past.

Here is a sample on how I use it.

DocumentConverter converter = new OpenOfficeDocumentConverter(connection);
converter.convert(new File("C:\\oreyes\\hola.doc"), 
                  new File("C:\\oreyes\\hola.pdf"));

It is slow, but does the job.

Unfortunately it does not work for Office new formats, docx, xmlx

It relays as mike said on OpenOffice.

OscarRyz
A: 

Many thanks for all of your answers. I have investigated the solutions put forward.

Unfortunately the POI > iText / FOP will not give me a copy of the document in a way that is useful to me - i.e. an exact replica, but in PDF/image form.

JODConverter uses OpenOffice. Sadly, the PDF conversion that OO performs isn't perfect for all cases (missing bullet points, overlapping text in some cases). Adobe and PDFCreator both do better, satisfactory jobs on PDF conversion testing I've performed.

The scanning software solution won't work for me either.

So.

What I'm looking into is using PDFCreator installed in server mode, with an auto-save option enabled which saves the printed file automatically (Yes I am on a Windows server, sorry for not clarifying that earlier). I would use a Java Print API to send to this, and use Java to read back to the file in. I'm not certain if this is a crazy idea or not. Please let me know what you think.

Thanks.

mieze
+5  A: 

OK here's what I do.

Use jCOM (a java to COM bridge that lets you make COM calls on the system) to open Word invisibly, and print the document to the default printer. The default printer is PDFCreator. PDFCreator is set up to autosave into a known directory. I then use jNotify to watch the directory for the PDF to finish being converted. I can then read it in, (and convert to tiff with Qoppas jPDFImages if I wish (not free).

We will see how this holds up under pressure... I'm hesitant to rely on Word but its the only app which gives us a perfect rendition of the document (due to PS I suspect).

Hope this helps / gives people ideas. PM me if stackoverflow supports it.

Doug

mieze
A: 

Aspose.Words will do it well. Our .NET version does it really well. Our Java version is just going to come out of Beta, so check back soon http://www.aspose.com/categories/java-components/aspose.words-for-java/default.aspx

romeok
A: 

If it has to be an exact copy of the look in word, than i believe you have to use word to do the conversion. The word format is rater complex and nearly impossible for anyone outside MS to read and reproduce complex documents.

What i did some years ago was to use a little VB script to open word with a given document, and print it to the default (pdf) printer. we used a postscript printer for this. The vb script is called from java.

It is not a solution that feels good, but we could not find anything else. And it is working for some years in production now.

We also made sure tat only one conversion is running at a time. Can't remember if that was necessary.

bert
A: 

Although some of the answers to this question provide a more 'native' Java based approach, the quality of the converted documents will be lacking. Depending on your needs this may not be a problem though.

If conversion fidelity is important to you then have a look at the Muhimbi PDF Converter Web Services. It runs on Windows as a service, but can be accessed from any non-Windows web services capable environment including Java and .NET

Disclaimer, I worked on this product. Having said that, it works great.

Muhimbi
+1  A: 

Perhaps too late, but worth posting in case others are looking these days - Docmosis can help. Accuracy is based on OpenOffice's impressive import/export engines. Performance is provided by Docmosis plus content manipulation if you require it.

jowierun
A: 

Qoppa Software, a company that specializes in PDF software, is currently developing new functionality to convert Word documents to PDF in pure Java with no other software needed. They are currently in beta testing.

Their website is at:

http://www.qoppa.com

Joe Holmes