views:

142

answers:

4

If I'm given a .doc file with special tags in it such as [first_name], how do I go about replacing all occurrences of it with something like "Clark"? A simple binary replacement only works if the replacement string is the exact same length.

Haskell, C, and C++ answers would be best, but any compiled language would do. I'd also prefer to do this without an external library since it has to be deployed on Windows and Linux and cross-platform dependency handling is a bitch.

To summarize...

.doc -> magic program -> .doc with strings replaced
+1  A: 

You probably have to use .Net programming (VB or C#) to create an object of Word.Application and then use the MS Word object model to manipulate your document.

Heath Hunnicutt
I'm not necessarily on windows.
Clark Gaebel
Well, given the "openness" of M$, there is no other guaranteed way to parse a DOC file correctly. The DOCX suggestion is great, provided those files were written by Word itself. Word is great at converting its own formats.
Heath Hunnicutt
@Clark - In other words, it is necessary for you to be on Windows. Whether to generate the docs or use the COM object, you can't reliably parse DOC anywhere else. I don't think that's a good thing, I'm just reporting my view of reality.
Heath Hunnicutt
Openoffice seems to do it fine.
Clark Gaebel
@Clark. For now. Good luck! Even Jake mentioned the issues with OpenOffice, but if it seems to work for you, then best wishes.
Heath Hunnicutt
+4  A: 

You could use the Word COM component ("Word.Application") on Windows to open the file, do the replacements, save the file, and close it. However, this is Windows-only and can be buggy.

Another thing you could do is use the OpenOffice.org command line interface to convert the file to the ODF format, unzip the file (ODF is mostly zipped XML), do the replacements with the files inside, re-zip the file, and re-convert it to .doc format. However, OpenOffice.org doesn't always read Word files correctly (especially if there is a lot of complex formatting) and it can make it harder to distribute (users must either have OpenOffice.org or you must distribute it with your program).

Also, if you have a file in the .docx format, you can unzip it, do the replacements, and re-zip it.

jake33
I'm not necessarily on windows, although using docx looks promising. One upboat for you good sir!
Clark Gaebel
.DOC converts to and from .RTF fairly gracefully in most versions of Word that use .DOC. RTF is effectively an assembly language for .DOC files, and with care it would be possible to do search and replace operations in it. I don't know of an easy to automate way to do the conversions offhand, but it probably does exist.
RBerteig
RTF seems perfect. Thanks for another great suggestion! I think I'll just accept this answer.
Clark Gaebel
+2  A: 

First read the Word Document Specification.

If that hasn't terrified you, then you should find it fairly straightforward to figure out how to read and write it. It must be possible; Word manages to do it most of the time.

Mike Seymour
It terrified me. I already looked at it (600 pages), and ran to SO screaming.
Clark Gaebel
+1. I really enjoy this kind of dry humor when it shows up in answers to this sort of question... especially when its also factually accurate!
RBerteig
A: 

Why do you want to be using C/C++/Haskell or another compiled language? I'm not too familiar with Haskell, but in general I would say that C is not a great language for performing text processing. A lot of interpreted languages (Perl, Python, etc.) also have powerful regular expression libraries that are suited for finding and replacing phrases.

With that said, as the other posters have noted, you will still have to deal with the eccentricities of the .doc format.

DuneBug