views:

904

answers:

6

I am looking for a way to extract / scrape data from Word files into a database. Our corporate procedures have Minutes of Meetings with clients documented in MS Word files, mostly due to history and inertia.

I want to be able to pull the action items from these meeting minutes into a database so that we can access them from a web-interface, turn them into tasks and update them as they are completed.

Which is the best way to do this:

  1. VBA macro from inside Word to create CSV and then upload to the DB?
  2. VBA macro in Word with connection to DB (how does one connect to MySQL from VBA?)
  3. Python script via win32com then upload to DB?

The last one is attractive to me as the web-interface is being built with Django, but I've never used win32com or tried scripting Word from python.

EDIT: I've started extracting the text with VBA because it makes it a little easier to deal with the Word Object Model. I am having a problem though - all the text is in Tables, and when I pull the strings out of the CELLS I want, I get a strange little box character at the end of each string. My code looks like:

sFile = "D:\temp\output.txt"
fnum = FreeFile
Open sFile For Output As #fnum

num_rows = Application.ActiveDocument.Tables(2).Rows.Count

For n = 1 To num_rows
    Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
    Assign = Application.ActiveDocument.Tables(2).Cell(n, 3).Range.Text
    Target = Application.ActiveDocument.Tables(2).Cell(n, 4).Range.Text
    If Target = "" Then
        ExportText = ""
    Else
        ExportText = Descr & Chr(44) & Assign & Chr(44) & _
            Target & Chr(13) & Chr(10)
        Print #fnum, ExportText
    End If
Next n

Close #fnum

What's up with the little control character box? Is some kind of character code coming across from Word?

A: 

I'd say look at the related questions on the right --> The top one seems to have some good ideas for going the python route.

ranomore
The question "extracting text from MS word files in python" is about working in a linux environment. Tools like antiword aren't available under Windows except in cygwin, whereas this poster is willing to do COM scripting of Word.
John Fouhy
If you don't have anything nice to say...Some of the higher voted answers to that question aren't linux-specific at all. I guess you missed those.
ranomore
+1  A: 

Well, I've never scripted Word, but it's pretty easy to do simple stuff with win32com. Something like:

from win32com.client import Dispatch
word = Dispatch('Word.Application')
doc = word.Open('d:\\stuff\\myfile.doc')
doc.SaveAs(FileName='d:\\stuff\\text\\myfile.txt', FileFormat=?)  # not sure what to use for ?

This is untested, but I think something like that will just open the file and save it as plain text (provided you can find the right fileformat) – you could then read the text into python and manipulate it from there. There is probably a way to grab the contents of the file directly, too, but I don't know it off hand; documentation can be hard to find, but if you've got VBA docs or experience, you should be able to carry them across.

Have a look at this post from a while ago: http://mail.python.org/pipermail/python-list/2002-October/168785.html Scroll down to COMTools.py; there's some good examples there.

You can also run makepy.py (part of the pythonwin distribution) to generate python "signatures" for the COM functions available, and then look through it as a kind of documentation.

John Fouhy
A: 

how about saving the file as xml. then using python or something else and pull the data out of word and into the database.

cbrulak
+4  A: 

Word has a little marker thingy that it puts at the end of every cell of text in a table.

It is used just like an end-of-paragraph marker in paragraphs: to store the formatting for the entire paragraph.

Just use the Left() function to strip it out, i.e.

 Left(Target, Len(Target)-1))

By the way, instead of

 num_rows = Application.ActiveDocument.Tables(2).Rows.Count
 For n = 1 To num_rows
      Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text

Try this:

 For Each row in Application.ActiveDocument.Tables(2).Rows
      Descr = row.Cells(2).Range.Text
Joel Spolsky
Thanks Joel! I had figured out that I could use Left() to strip of the end of cell marker, but that didn't seem elegant to me. Also, thanks for the other pointer. I'm no expert programmer and definitely not a VBA guru.
Technical Bard
+1  A: 

You could use OpenOffice. It can open word files, and also can run python macros.

nosklo
A: 

It is possible to programmatically save a Word document as HTML and to import the table(s) contained into Access. This requires very little effort.

Remou