tags:

views:

1553

answers:

6

Is there a reasonable way to extract plain text from a Word file that doesn't depend on COM automation? (This is a a feature for a web app deployed on a non-Windows platform - that's non-negotiable in this case.)

Antiword seems like it might be a reasonable option, but it seems like it might be abandoned.

A Python solution would be ideal, but doesn't appear to be available.

+3  A: 

Open Office has an API

Unsliced
A: 

If it is Word 2007 docx, you could unzip it and parse the XML files that are contained inside.

Guy Starbuck
+2  A: 

Using the OpenOffice API, and Python, and Andrew Pitonyak's excellent online macro book I managed to do this. Section 7.16.4 is the place to start.

One other tip to make it work without needing the screen at all is to use the Hidden property:

RO = PropertyValue('ReadOnly', 0, True, 0)
Hidden = PropertyValue('Hidden', 0, True, 0)
xDoc = desktop.loadComponentFromURL( docpath,"_blank", 0, (RO, Hidden,) )

Otherwise the document flicks up on the screen (probably on the webserver console) when you open it.

paulmorriss
+4  A: 

I use catdoc or antiword for this, whatever gives the result that is the easiest to parse. I have embedded this in python functions, so it is easy to use from the parsing system (which is written in python).

import os

def doc_to_text_catdoc(filename):
    (fi, fo, fe) = os.popen3('catdoc -w "%s"' % filename)
    fi.close()
    retval = fo.read()
    erroroutput = fe.read()
    fo.close()
    fe.close()
    if not erroroutput:
        return retval
    else:
        raise OSError("Executing the command caused an error: %s" % erroroutput)

# similar doc_to_text_antiword()

The -w switch to catdoc turns off line wrapping, BTW.

codeape
+1  A: 

For docx files, check out the Python script docx2txt available at

http://cobweb.ecn.purdue.edu/~kak/distMisc/docx2txt

for extracting the plain text from a docx document.

+2  A: 

(Same answer as http://stackoverflow.com/questions/125222/extracting-text-from-ms-word-files-in-python)

Use the native Python docx module which I made this week. Here's how to extract all the text from a doc:

document = opendocx('Hello world.docx')

# This location is where most document content lives 
docbody = document.xpath('/w:document/w:body', namespaces=wordnamespaces)[0]

# Extract all text
print getdocumenttext(document)

See http://github.com/mikemaccana/python-docx

100% Python, no COM, no .net, no Java, no parsing serialized XML with regexs, no crap.

nailer