tags:

views:

904

answers:

6

I'm looking into a requirements management system (like requiste pro - Rational Rose) - and will need to read through a MS Word doc searching for specific tags - on either a windows or Apple OS environment. Are there any known frameworks for this (I couldn't find any) - or suggested approaches?

Just to add some clarification - this would not be a one-time read, I'd review the doc every time there is an update to it and perform a CRUD on the requirement specific areas.

+3  A: 

First, get it out of native Word (.doc) format.

  • Do a "Save As XML" and insist your users work with that file instead of the .doc file. They'll hardly notice the difference -- except that the file is bigger.

    Use lxml or element tree to parse the XML and find the headings, sections, paragraphs and lists.

  • You can also do a "Save As HTML" before doing your analysis. This works just as well as the XML version. The HTML version isn't as easy for users, however, so do this prior to your analysis only.

    Use Beautiful Soup to parse the HTML and find the headings, sections, paragraphs and lists.

Once you have a parse structure (XML or HTML) you can analyze the document looking for specific tags.

S.Lott
A: 

If you have a bit of cash you can buy the Aspose.Words Java API. With it you can programatically access and manipulate any Word document

Conrad
If you are prepared to use Java, docx4j is also an option.
plutext
A: 

I know this is a Python question but...

On Windows you should use VBScript (VBA Macros) and OLE to programmically access Word.

Examples | How-tos | Automating Word using OLE

On MacOSX you use VBA for older versions and AppleScript for Office 2008.

Article

With VBA you have a choice of either modifying the document in-place or performing an automated "Save As" to get the data in a more easily handled format (though be warned its HTML export is abysmal).

I strongly recommend staying away from third-party libraries/products for this, even if you dislike vbscript. The format is far too complex, undocumented and inconsistent for accurate external handling. StarOffice/OpenOffice is proof of that. They've been trying for years and still haven't got accurate .doc parsing, let alone .docx. Yes it works in general but you run an unquantifiable risk of mangling documents once you start trying to programmically modify them outside of Word. You should be able to call VBscript from Python using os.system. I think the interpreter is wscript.exe but don't hold me to that. This may work though:

os.system('start script.vb')
SpliFF
Is DOCX not supposedly easier as it is closer to *actual* XML? Pre-2007 "MS" XML didn't count. It's also an "open standard"! ;)
Lucas Jones
It's an open standard encompassed in a 6000 page document and allowing the inclusion of "binary blobs" for backwards compatibility. Reliable and accurate third-party support will be a while coming I think.
SpliFF
+2  A: 

You can build on the ability of openoffice.org to read Word documents. The Python-UNO bridge allows to use the standard OpenOffice.org API from the python scripting language. Using Python-UNO and having relevant parts of openoffice on your machine, it should be straightforward to read most Word documents.

gimel
+2  A: 

Using Visual Studio Tools for Office (VSTO), it is possible to script Word from any .NET language. The How to: Search for Text in Documents example shows C# and Visual Basic code, but IronPython can also call the same .NET methods.

If you are ready to use IronPython (no Mac equivalent), this could be a Windows specific solution for searching inside Word documents.

gimel
+2  A: 

Assuming you are on windows and have Word installed you can control Word from within python using COM - see Python for win32 On Linux you can do the same with OpenOffice.

Alternatively there are a bunch of string extractors for Word for both win32 or Linux, you can then use the normal python regex tools.

See this question http://stackoverflow.com/questions/125222/extracting-text-from-ms-word-files-in-python

Martin Beckett