views:

578

answers:

3

Is there a programmatic way to extract equations (and possibly images) from an MS Word document? I've googled all over, but have yet to find anything that I can sink my teeth into and work from. If possible, I'd like to be able to do this with VB.NET or C#, but I can pick up enough of any language to hack out a DLL. Thanks!

EDIT: Right now I'm looking at extracting the equations from Word 2003, but if converting it to 2007/Open XML is required, that's fine.

A: 

Try looking at the Word-to-latex converter. It requires the .Net framework and although the source isn't opened yet the author does invite questions about this.

Rob

RobS
+1  A: 

What Word format are your documents in? If they are in Open XML (file extension .docx) you could use the Open XML SDK available from Microsoft to extract images and embedded content.

An Open XML file is nothing but a zip archive using a special structure. You will find examples in the SDK how to access parts of that zip archive. Actually you could use any zip-capable library to extract the content from the document package.

If the documents still use the older binary format things are a bit more complicated. I think the easiest way would be to convert the documents to the Open XML format. There are several ways to do this:

  • Get the free and open b2xtranslator from SourceForge which offers you C# dlls for file conversion.
  • Install Microsoft's Compatibility Pack and use the following command line for conversion:

    "C:\Program Files\Microsoft Office\Office12\wordconv.exe" -oice -nme input_file output_file

where input_file and output_file must be full path names.

0xA3
+1  A: 

I don't know if any of this will help, but the object model in Word 2000/2003 has an InlineShapes collection as part of the Document object which represents embedded images and possibly similar objects like equations.

Some VBA code to copy the first item onto the clipboard, which might help you extract them:

ThisDocument.InlineShapes.Items(1).Select
Selection.Copy

It's accessible in .NET too, MSDN link.

xahtep
This is actually what I went with. Thank you!
AndyB