tags:

views:

464

answers:

3

I'm looking for something in Java to read in Word documents to process their text.. all I need is there text, nothing fancy. I know about Apache POI, however it doesn't include support for DOCX right now, anything out there?

+2  A: 

With some googling I found OpenXML4J. This might solve your issue. I have not used this before I am sure someone in the community will have better insight.

Note: This is a duplicate question. This has the solution plus a bit of discussion. Link to the question.

tathamr
Is it reasonable to keep both questions, given that one is asking about Word doc format and the other Excel? They may be two subsets of one larger document format spec, I honestly don't know.
Bill the Lizard
I believe it is a duplicate because each question is asking about office 2007 java api. The other question, IMHO, does answer the mail. :)
tathamr
+1  A: 

If you don't require formatting information, images and all other fancy stuff, then the job is lot easier. Just some 5 to 10 lines of code will do.

  1. Treat DOCX as a zip file. It consists a bunch of files which includes 'document.xml'. Use ZipInputStream and extract that file alone. (you may use your favorite zip utility and open docx and see for yourself!)
  2. Use a SAX parser and read contents between node body/p/r/t - voila you got the text!

This is applicable only if you need the text only.

Joseph Kulandai