tags:

views:

299

answers:

3

I'm wondering how you can convert Word .doc/.docx files to text files through Java. I understand that there's an option where I can do this through Word itself but I would like to be able to do something like this:

java DocConvert somedocfile.doc converted.txt

Thanks.

+1  A: 

You should consider using this library. Its Apache POI

Excerpt from the website

In short, you can read and write MS Excel files using Java. In addition, you can read and write MS Word and MS PowerPoint files using Java. Apache POI is your Java Excel solution (for Excel 97-2008). We have a complete API for porting other OOXML and OLE2 formats and welcome others to participate.

Bragboy
+6  A: 

If you're interested in a Java library that deals with Word document files, you might want to look at e.g. Apache POI. A quote from the website:

Why should I use Apache POI?

A major use of the Apache POI api is for Text Extraction applications such as web spiders, index builders, and content management systems.


P.S.: If, on the other hand, you're simply looking for a conversion utility, Stack Overflow may not be the most appropriate place to ask for this.


Edit: If you don't want to use an existing library but do all the hard work yourself, you'll be glad to hear that Microsoft has published the required file format specifications. (The Microsoft Open Specification Promise lists the available specifications. Just google for any of them that you're interested in. In your case, you'd need e.g. the OLE2 Compound File Format, the Word 97 binary file format, and the Open XML formats.)

stakx
Oh sorry, I would like to build the utility I'm talking about.
Coding District
A: 

Docmosis can read a doc and spit out the text in it. Requires some infrastructure to be installed (such as OpenOffice). You can also use JODConverter.

jowierun