views:

1212

answers:

2

On September 28, 2009 the Apache POI project released version 3.5 which officially supports the OOXML formats introduced in Office 2007, like DOCX and XLSX.

Please provide a code sample for extracting a DOCX file's content in plain text, ignoring any styles or formatting.

I am asking this because I have been unable to find any Apache POI examples covering the new OOXML support.

A: 

This worked for me. Make sure you add the required jars (upgrade xmlbeans, etc.)

public String extractText(InputStream in) throws Exception {
 XWPFDocument doc = new XWPFDocument(in);
 XWPFWordExtractor ex = new XWPFWordExtractor(doc);
 String text = ex.getText();
 return text;
}
Tanuj Chatterjee
A: 

This is more generic

POITextExtractor poitex = ExtractorFactory.createExtractor(in);

return poitex.getText();

Tanuj Chatterjee
I agree. Thank you for a good answer covering more generic text extraction. I wish I could accept both.
rcampbell