How can I scan a bunch of Microsoft Word (2003) documents? I am searching for a certain phrase in the documents and want to return the file names of those which contain the phrase.
A code sample would be especially helpful.
How can I scan a bunch of Microsoft Word (2003) documents? I am searching for a certain phrase in the documents and want to return the file names of those which contain the phrase.
A code sample would be especially helpful.
For 2007 documents, its fairly easy using the Open XML SDK 2.0 framework. These files are basically a zip file and inside the zip file contains a bunch of xml files that contains the contents of documents. All you'd have to do is scan the xml to find what you need using this framework.
In fact if you had a word 2007 document named testDoc.docx you could just add .zip to the end of it and you can peek inside. So it would look like testDoc.docx.zip
The main xml file is a file called document.xml
In Word 2003, you can search the binary file, most text is intact within there.
If you have access to antiword and grep, it should be as easy as
for file in `antiword *.doc | grep -vf word_to_find`
do
echo "$file"
done
Is the Word Interop service an option for you? If Word is installed on the server you could simply open the word documents and use the Find features of Word itself to locate your text. Of course, this is a resource intensive method, and there are a lot of things to be careful of when using Interop services, but it would get the job done.
Your biggest issue may be that an appropriate version of Word needs to be installed on the server.
Don't have the codesample at hand. But have you looked at windows indexing service and installing the Office plugin?
You could do it with COM. However, if you are scanning a lot of files this might be painfully slow since you will be interacting with the text through Word itself.
Here is some python code using (sorry, I don't know much .Net, but the COM functions will be similar)
I'm guessing you might have to trim up the whitespace a bit to get good matches.
import os, win32com.client
def doc_has_phrase(filename, phrase):
found = False
app = win32com.client.Dispatch('Word.Application')
doc = app.Documents.Open(filename, False, False, False)
if phrase in doc.Content.Text.lower():
found = True
app.Quit()
return found
phrase = 'key phrase in lowercase'
valid_types = ['doc']
path = "C:\\Path\\To\\Files\\"
docs = dict ([(f, None) for f in os.listdir (path) if f[-3:] in valid_types])
for doc in docs:
print doc_has_phrase(path+doc, phrase), path+doc
A VB.NET version of allclaws' method:
Function ContainsText(ByVal fileName As String, ByVal text As String) As Boolean
Dim app As New Microsoft.Office.Interop.Word.Application
Dim doc As Microsoft.Office.Interop.Word.Document
doc = app.Documents.Open(DirectCast(fileName, Object))
Try
Return doc.Content.Text.IndexOf(text, StringComparison.CurrentCultureIgnoreCase) <> -1
Finally
app.Quit()
End Try
End Function
Edit:
To make this work you need to add a reference to the Microsoft Word 11.0 Object Library, available on the COM tab of the Add References dialog