views:

148

answers:

4

I'm looking for a library (or command line tool) to turn MS Office documents into either plaintext or HTML (for conversion to text).

It must run on Linux (not via Wine!).

I found antiword, but the last release was 2005, so it won't read the new Office 2007 formats.

I need it to read Word, Excel and Powerpoint documents

+4  A: 

The new office 2007 format is just (ZIP) compressed XML.

All the text (in at least the .docx format) is located (once you decompress the file) in the word folder, document.xml file. Strip it from all the XML tags and you'll get the text. You'll lose the formatting no doubt, but if you want to do text indexing or something like it format isn't relevant anyway. The order is preserved.

I haven't analyzed Excel and Powerpoint but the approach should be similar. Excel might be trickier, depending on how are the cells stored in the XML file.

Vinko Vrsalovic
A: 

PyODConverter for automating OpenOffice. Use it to do the conversions.
OONinja example converting Doc to PDF but any OpenOffice supported imports or exports should work. Also has the advantage of working Headless if required.

other options include, Abiword or you really just want to deal with command line WvWare but I don't think it supports Docx,

10ToedSloth
+1  A: 

The Apache POI library can extract text from office formats. This is used by Tika in Lucene. Tika can be executed as a command line tool:

curl http://.../document.doc \
  | java -jar tika-app-x.y.jar --text \
  | grep -q keyword
Thomas Jung
After much vacilation amongst various solutions (including writing our own based on the ECMA standard as suggested by Vunko) we're probably going to use the POI libraries.
RickMeasham
A: 

You can use Autonomy Keyview with the appropriate licence to use in your application. It seems to be extremely powerful and can extract text from almost everything; we use it to identify text within arbitrary format files.

I've no idea what the licensing terms are, but they're available from your account manager :)

MarkR