ansaurus

Question

How can I read a Word 2007 .docx file?

Answer 1

+2 A:

A docx is just a zip archive with lots of files inside. Maybe you can look at some of the contents of those files? Other than that you probably have to find a lib that understands the word format so that you can filter out things you're not interested in.

A second choice would be to interop with word and do the search through it.

kokos 2008-09-22 17:16:43

Answer 2

+1 A:

a docx file is essentially a zip file with an xml inside it.
the xml contains the formatting but it also contains the text.

shoosh 2008-09-22 17:17:15

Answer 3

A:

You should be able to use the MSWord ActiveX interface to extract the text to search (or, possibly, do the search). I have no idea how you access ActiveX from Python though.

Andy Brice 2008-09-22 17:17:26

Answer 4

+19 A:

More exactly, a .docx document is a Zip archive in OpenXML format: you have first to uncompress it.
I downloaded a sample (Google: some search term filetype:docx) and after unzipping I found some folders. The word folder contains the document itself, in file document.xml.

PhiLho 2008-09-22 17:22:10

Answer 5

+1 A:

OLE Automation would probably be the easiest. You have to consider formatting, because the text could look like this in the XML:

Looking for this phrase

There's no easy way to find that using a simple text scan.

ilitirit 2008-09-22 23:03:32

Answer 6

+6 A:

In this example, "Course Outline.docx" is a Word 2007 document, which does contain the word "Windows", and does not contain the phrase "random other string".

>>> import zipfile
>>> z = zipfile.ZipFile("Course Outline.docx")
>>> "Windows" in z.read("word/document.xml")
True
>>> "random other string" in z.read("word/document.xml")
False
>>> z.close()

Basically, you just open the docx file (which is a zip archive) using zipfile, and find the content in the 'document.xml' file in the 'word' folder. If you wanted to be more sophisticated, you could then parse the XML, but if you're just looking for a phrase (which you know won't be a tag), then you can just look in the XML for the string.

Tony Meyer 2008-09-22 23:12:59

It's probably easier to look for the phrase in the element text (using an XML parser) than having to worry about whether part of your text is matched by an element name.

nailer 2009-12-27 12:59:51

Answer 7

A:

Here is a example of using XLINQ to search throu a word document

rudigrobler 2008-09-26 08:03:35

Answer 8

A:

You may also consider using the library from OpenXMLDeveloper.org

billb 2008-10-18 19:34:32

Answer 9

+5 A:

A problem with searching inside a Word document XML file is that the text can be split into elements at any character. It will certainly be split if formatting is different, for example as in Hello World. But it can be split at any point and that is valid in OOXML. So you will end up dealing with XML like this even if formatting does not change in the middle of the phrase!

<w:p w:rsidR="00C07F31" w:rsidRDefault="003F6D7A">

<w:r w:rsidRPr="003F6D7A">

<w:rPr>

<w:b /> 

</w:rPr>

<w:t>Hello</w:t> 

</w:r>

<w:r>

<w:t xml:space="preserve">World.</w:t> 

</w:r>

</w:p>

You can of course load it into an XML DOM tree (not sure what this will be in Python) and ask to get text only as a string, but you could end up with many other "dead ends" just because the OOXML spec is around 6000 pages long and MS Word can write lots of "stuff" you don't expect. So you could end up writing your own document processing library.

Or you can try using Aspose.Words.

It is available as .NET and Java products. Both can be used from Python. One via COM Interop another via JPype. See Aspose.Words Programmers Guide, Utilize Aspose.Words in Other Programming Languages (sorry I can't post a second link, stackoverflow does not let me yet).

romeok 2009-11-15 11:01:08

Answer 10

+6 A:

Hi Gerry,

After reading your post above, I made a 100% native Python docx module to solve this specific problem.

# Import the module
from docx import *

# Open the .docx file
document = opendocx('A document.docx')

# Search returns true if found    
search(document,'your search string')

The docx module is at http://github.com/mikemaccana/python-docx

nailer 2009-12-30 12:08:33

That link is a 404.

Tony Meyer 2010-01-02 23:42:41

Thanks, fixed it.

nailer 2010-01-03 11:05:45

Answer 11

A:

I like Mike MacCana's docx module, and have found it extremely useful. There's one thing that I would like to be able to do, however, that I haven't been able to figure out: I'd like to be able to determine whether any given paragraph is ordinary text or a section heading, and if it is a section heading, the level. Does anyone know of a way of doing this?

Phillip

Dr. Phillip M. Feldman 2010-07-02 18:45:24

This isn't an answer, it's a question! Create your own question, citing and linking to nailer's answer; then you have the best chance of getting an answer.

Charles Stewart 2010-07-03 12:41:46

ansaurus

tags:

views:

answers:

How can I read a Word 2007 .docx file?

related questions