views:

79

answers:

2

I need to break open a MS Word file (.doc) and extract its constituent files ('[1]CompObj', 'WordDocument' etc). Something like 7-zip can be used to do this manually but I need to do this programatically.

I've gathered that a Word document is an OLE container (hence why 7-zip can be used to view its contents) but I can't work out how to (using C++):

  1. open the OLE container
  2. extract each constituent file and save it to disk

I've found a couple of examples of OLE automation (eg here) but what I want to do seems to be less common and I've found no specific examples.

If anyone has any idea of either an API (?!) and tutorial for working with OLE I'd be grateful. Ditto any code samples. Thanks in advance.

+2  A: 

It is called Compound Files, part of the Structured Storage API. You start with StgOpenStorageEx(). It buys you little for a Word .doc file, the streams themselves have a sophisticated binary format. To really read the document content you want to use automation, letting Word read the file. That's rarely done in C++ but that project shows you how.

Hans Passant
A: 

This site http://www.endurasoft.com/vcd/ststo.htm contains both tutorial, API information and code sample that does everything I was looking for.

Ben L