I'm looking for something in Java to read in Word documents to process their text.. all I need is there text, nothing fancy. I know about Apache POI, however it doesn't include support for DOCX right now, anything out there?
+2
A:
With some googling I found OpenXML4J. This might solve your issue. I have not used this before I am sure someone in the community will have better insight.
Note: This is a duplicate question. This has the solution plus a bit of discussion. Link to the question.
tathamr
2010-02-15 04:51:31
Is it reasonable to keep both questions, given that one is asking about Word doc format and the other Excel? They may be two subsets of one larger document format spec, I honestly don't know.
Bill the Lizard
2010-02-15 05:40:48
I believe it is a duplicate because each question is asking about office 2007 java api. The other question, IMHO, does answer the mail. :)
tathamr
2010-02-15 13:57:34
+1
A:
If you don't require formatting information, images and all other fancy stuff, then the job is lot easier. Just some 5 to 10 lines of code will do.
- Treat DOCX as a zip file. It consists a bunch of files which includes 'document.xml'. Use ZipInputStream and extract that file alone. (you may use your favorite zip utility and open docx and see for yourself!)
- Use a SAX parser and read contents between node body/p/r/t - voila you got the text!
This is applicable only if you need the text only.
Joseph Kulandai
2010-03-01 17:04:54
A:
You could try docx4j; see http://dev.plutext.org/svn/docx4j/trunk/docx4j/src/main/java/org/docx4j/TextUtils.java
plutext
2010-08-31 03:10:25