tags:

views:

295

answers:

8

How can I scan a bunch of Microsoft Word (2003) documents? I am searching for a certain phrase in the documents and want to return the file names of those which contain the phrase.

A code sample would be especially helpful.

A: 

For 2007 documents, its fairly easy using the Open XML SDK 2.0 framework. These files are basically a zip file and inside the zip file contains a bunch of xml files that contains the contents of documents. All you'd have to do is scan the xml to find what you need using this framework.

In fact if you had a word 2007 document named testDoc.docx you could just add .zip to the end of it and you can peek inside. So it would look like testDoc.docx.zip

The main xml file is a file called document.xml

irperez
Unfortunately, these are 2003 documents. Thanks.
Josh Stodola
A: 

In Word 2003, you can search the binary file, most text is intact within there.

ck
A: 

If you have access to antiword and grep, it should be as easy as

for file in `antiword *.doc | grep -vf word_to_find`
do
    echo "$file"
done
Steen
A: 

You could use the Word Object Model

crashmstr
A: 

Is the Word Interop service an option for you? If Word is installed on the server you could simply open the word documents and use the Find features of Word itself to locate your text. Of course, this is a resource intensive method, and there are a lot of things to be careful of when using Interop services, but it would get the job done.

Your biggest issue may be that an appropriate version of Word needs to be installed on the server.

Cory Larson
Who said anything about a server...? I do have Word installed on the machine that this is going to run on. I certainly don't want to manually serach 1,000 files for a piece of text! I need to do this programmatically with .NET.
Josh Stodola
Funny that the highest voted answer (allclaws') has to create an instance of Word, the same way Interop services does, yet I get voted down.
Cory Larson
Also, see Patrick McDonald's comment -- that is exactly what I was suggesting.
Cory Larson
A: 

Don't have the codesample at hand. But have you looked at windows indexing service and installing the Office plugin?

Emil C
+2  A: 

You could do it with COM. However, if you are scanning a lot of files this might be painfully slow since you will be interacting with the text through Word itself.

Here is some python code using (sorry, I don't know much .Net, but the COM functions will be similar)

I'm guessing you might have to trim up the whitespace a bit to get good matches.

import os, win32com.client

def doc_has_phrase(filename, phrase):
    found = False
    app = win32com.client.Dispatch('Word.Application')
    doc = app.Documents.Open(filename, False, False, False)
    if phrase in doc.Content.Text.lower():
        found = True
    app.Quit()
    return found

phrase = 'key phrase in lowercase'
valid_types = ['doc']
path = "C:\\Path\\To\\Files\\"

docs = dict ([(f, None) for f in os.listdir (path) if f[-3:] in valid_types])
for doc in docs:
    print doc_has_phrase(path+doc, phrase), path+doc
allclaws
+2  A: 

A VB.NET version of allclaws' method:

Function ContainsText(ByVal fileName As String, ByVal text As String) As Boolean

    Dim app As New Microsoft.Office.Interop.Word.Application
    Dim doc As Microsoft.Office.Interop.Word.Document
    doc = app.Documents.Open(DirectCast(fileName, Object))

    Try
        Return doc.Content.Text.IndexOf(text, StringComparison.CurrentCultureIgnoreCase) <> -1
    Finally
        app.Quit()
    End Try

End Function

Edit:

To make this work you need to add a reference to the Microsoft Word 11.0 Object Library, available on the COM tab of the Add References dialog

Patrick McDonald
THANKS!! What reference do I need to add to make this work? If I don't have the DLL, where can I find it?
Josh Stodola
@Josh, the library is part of Word, so if you have Word installed on your machine then you'll have it.
Patrick McDonald