views:

572

answers:

1

I have a large collection of MSWord documents (approximately 40,000), which are the results of mailmerges (same main document, different data sources).

One of the merge fields is a text field which could have the text "Yes" or "No".

Is there an easy way to list which of the documents have that merge field set to the value "Yes"? (I'm expecting approximately 10,000 "Yes" documents.)

I'd be interested in any approach, whether using Word itself, Office Automation, hexdumping the binary files and grepping for certain magic, or any ready-made tools (perl scripts, .NET apps, etc) which can do this sort of thing.

The files are on a network share accessible from both Linux and Windows boxes (and I can probably steal a Mac for a little while if necessary), so I'm not too worried about which platform the tools run on...

+1  A: 

If they were Word 2007 documents it'd be much easier, as the file format is XML. (Even with Word 2003 you can save as an XML document, though it's not the default). I assume however that these are standard Word 2003 documents using the default (binary) file format.

I believe that there are tools out there which can process the binary file format directly, and which might be able to convert the docs into text files which you could then process - presumably you could search for some text appearing just before the field, e.g. "Are you serious:".

However, the easiest/simplest way (but slowest, in terms of execution time) would be to write a VBA program to open each doc, search for the field, and extract the result. It'd be pretty straightforward VBA, and you could do it in Word itself (which would mean that the code could use the existing running instance of Word). I'd say you could get that up and running in a couple of hours - then you could put your feet up for a few more hours while it did its work :-)

Gary McGill
Yep, that's my assessment of the situation at the moment. I was considering using wvWare as the binary processor option, or the VBA route. I was hoping someone had either a) suggestions of pre-existing tools/VBA scripts, or b) some docs on where in the binary format to look.
Stobor