views:

1158

answers:

5

The release notes of a software have some important data that I would like to extract in every release. Is there a way to extract certain information from Microsoft Word?

The application that I am thinking of would be written in C#, but I am okay if it is any other solution.

A: 

I did a lot of excel programming with the VSTO (Visual Studio Tools for Office) tools, I think you will be able to use the VSTO API to read a word doc. You should be able to use C#

iterationx
A: 

You could write an IFilter to extract text from word files. No need to have Word installed.

Darin Dimitrov
A: 

All MS Office products (Word, Office, etc.) are totally scriptable, both internally (using VBA) and externally (via OLE Automation, also known as ActiveX; in fact, VBA uses the interface exposed through OLE).

My suggestion would be to look for a library in your language that supports this. Here is a link to a Perl module, Win32::OLE, that does: as you can see, it's quite easy to use and very powerful. The interface should be similar for other languages.

j_random_hacker
+1  A: 

I went through this a few years back. You can:

  1. Use Word to convert the file into some other format, ASCII, RTF, XML etc.

  2. Use some third-party app to convert to another format, such as ASCII.

  3. Access the Word API through OLE and extract the information directly.

I couldn't find any generic libraries to read Word files, and back then all of the applications that read Word files only worked for a subset. Word changed often enough that they had trouble keeping up.

There were some documents that listed the specifics of the older Word file formats, the underlying file structure is outrageously complicated. Without a lot of resources it would be hard to keep code in sync with the file format.

Initially, I used Perl to drive Word and create new documents, but the solution was too fragile. Later I switch the whole application to work with PDFs instead, and gave up on Word.

Paul.

Paul W Homer
A: 

You can work from within Word (VBA, VSTO) or outside it.

From outside it, automation is one approach.

Another is to avoid using Word entirely. If the docs are .docx, you can use anything which can manipulate an Open XML file. Microsoft has its Open XML SDK, and in the Java world you can use docx4j or POI.

plutext