ansaurus

Question

extracting text from MS word files in python

Answer 1

+3 A:

Take a look at how the doc format works and create word document using PHP in linux. The former is especially useful. Abiword is my recommended tool. There are limitations though:

However, if the document has complicated tables, text boxes, embedded spreadsheets, and so forth, then it might not work as expected. Developing good MS Word filters is a very difficult process, so please bear with us as we work on getting Word documents to open correctly. If you have a Word document which fails to load, please open a Bug and include the document so we can improve the importer.

Swati 2008-09-24 03:17:26

Not just that though! Even the most basic text saved in the Word 97 format is nearly impossible to get at easily without relying on word to do it for you (COM). Most word documents are not HTML!

William Keller 2008-09-24 03:30:30

Abiword doesn't assume that it's a HTML document, and considering how extensive the tool is...I don't think it was "easy" to implement it. Abiword is a tool that helps you to read MS Word files...and since the author is concerned with text retrieval, this suffices.

Swati 2008-09-24 03:42:19

Ah, I'd always thought that abiword was just another word processor! Man, that would have saved me some headaches awhile back.

William Keller 2008-09-24 12:11:05

Answer 2

+1 A:

I'm not sure if you're going to have much lock without using COM. The .doc format is ridiculously complex, and is often called a "memory dump" of Word at the time of saving!

At Swati, that's in HTML, which is fine and dandy, but most word documents aren't so nice!

William Keller 2008-09-24 03:19:53

Added limitations :)

Swati 2008-09-24 03:22:37

Answer 3

+7 A:

OpenOffice.org can be scripted with Python: see here.

Since OOo can load most MS Word files flawlessly, I'd say that's your best bet.

Dan 2008-09-24 03:23:42

Not flawlessly. Close, but far from flawless in my experience (OO 2.0 - 3.0).

SpliFF 2009-05-26 15:17:30

As flawless as MS Word N+1 opens MS Words N files, and way better than MS Word N+1 opens MS Words N-1 files, IMHO

voyager 2009-09-29 14:50:55

Answer 4

+6 A:

You could make a subprocess call to antiword. Antiword is a linux commandline utility for dumping text out of a word doc. Works pretty well for simple documents (obviously it loses formatting). It's available through apt, and probably as RPM, or you could compile it yourself.

John Fouhy 2008-09-24 04:13:03

Answer 5

+2 A:

I know this is an old question, but I was recently trying to find a way to extract text from MS word files, and the best solution by far I found was with wvLib:

http://wvware.sourceforge.net/

After installing the library, using it in Python is pretty easy:

import commands

exe = 'wvText ' + word_file + ' ' + output_txt_file
out = commands.getoutput(exe)
exe = 'cat ' + output_txt_file
out = commands.getoutput(exe)

And that's it. Pretty much, what we're doing is using the commands.getouput function to run a couple of shell scripts, namely wvText (which extracts text from a Word document, and cat to read the file output). After that, the entire text from the Word document will be in the out variable, ready to use.

Hopefully this will help anyone having similar issues in the future.

Dave 2009-01-01 01:14:38

Answer 6

+1 A:

(Note: I posted this on this question as well, but it seems relevant here, so please excuse the repost.)

Now, this is pretty ugly and pretty hacky, but it seems to work for me for basic text extraction. Obviously to use this in a Qt program you'd have to spawn a process for it etc, but the command line I've hacked together is:

unzip -p file.docx | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'

So that's:

unzip -p file.docx: -p == "unzip to stdout"

grep '<w:t': Grab just the lines containing '<w:t' (<w:t> is the Word 2007 XML element for "text", as far as I can tell)

sed 's/<[^<]>//g'*: Remove everything inside tags

grep -v '^[[:space:]]$'*: Remove blank lines

There is likely a more efficient way to do this, but it seems to work for me on the few docs I've tested it with.

As far as I'm aware, unzip, grep and sed all have ports for Windows and any of the Unixes, so it should be reasonably cross-platform. Despit being a bit of an ugly hack ;)

Ben Williams 2009-08-11 05:38:51

Answer 7

+1 A:

If your intention is to use purely python modules without calling a subprocess, you can use the zipfile python modude.

content = ""
# Load DocX into zipfile
docx = zipfile.ZipFile('/home/whateverdocument.docx')
# Unpack zipfile
unpacked = docx.infolist()
# Find the /word/document.xml file in the package and assign it to variable
for item in unpacked:
    if item.orig_filename == 'word/document.xml':
        content = docx.read(item.orig_filename)

    else:
        pass

Your content string however needs to be cleaned up, one way of doing this is:

# Clean the content string from xml tags for better search
fullyclean = []
halfclean = content.split('<')
for item in halfclean:
    if '>' in item:
        bad_good = item.split('>')
        if bad_good[-1] != '':
            fullyclean.append(bad_good[-1])
        else:
            pass
    else:
        pass

# Assemble a new string with all pure content
content = " ".join(fullyclean)

But there is surely a more elegant way to clean up the string, probably using the re module. Hope this helps.

benjamin 2009-11-12 16:18:16

Answer 8

A:

benjamin's answer is a pretty good one. I have just consolidated...

import zipfile, re

docx = zipfile.ZipFile('/path/to/file/mydocument.docx')
content = docx.read('word/document.xml')
cleaned = re.sub('<(.|\n)*?>','',content)
print cleaned

Chad 2009-12-28 03:39:54

I should reiterate this only works for docx (Word 2007 or later). For .doc files wvware is your best bet. Depending on your environment it can be a pain to setup, but it does do a very nice job.

Chad 2009-12-28 03:41:58

Answer 9

+1 A:

Use the native Python docx module. Here's how to extract all the text from a doc:

document = opendocx('Hello world.docx')

# This location is where most document content lives 
docbody = document.xpath('/w:document/w:body', namespaces=wordnamespaces)[0]

# Extract all text
print getdocumenttext(document)

See http://github.com/mikemaccana/python-docx

Parsing XML with regexs invokes cthulu. Don't do it!

nailer 2009-12-30 12:17:09

Answer 10

A:

You can load DOC, DOCX, RTF, WordML or HTML into Aspose.Words and then just use Document.GetText().

Aspose.Words is available as a library for .NET and for Java. You can use either one but with a corresponding way of invoking it.

http://www.aspose.com/documentation/.net-components/aspose.words-for-.net-and-java/utilize-aspose-words-in-other-programming-languages.html

romeok 2010-01-21 01:23:13

ansaurus

tags:

views:

answers:

extracting text from MS word files in python

related questions