views:

871

answers:

4

Hello there,

I have some word templates(dot/dotx) files that contain xml tags along with plain text.
At run time, I need to replace the xml tags with their respective mail merge fields.

So, need to parse the document for these xml tags and replace them with merge fields. I was using Regex to find and replace these xml tags. But I was suggested to use XML parser to parse for XML tags (http://stackoverflow.com/questions/1902095/regex-for-string-enclosed-in-c)

Now that I have presented my case better,
could you please guide if XML parser will be a right tool to achive above?
if yes, do I need to save the word document as xml file and then need to parse for xml tags?

Please guide.

+1  A: 

Why don't you use the Word APIs to do this? I can't imagine any way to do this safely without using the APIs that were designed for the purpose.

John Saunders
Well, I am using Aspose to insert the Merge Fields and don't want to be dependant on any particular version of Word to be installed on server.
inutan
You didn't say you were running on a server. Please update you question with details of your environment.
John Saunders
A: 

Yes, you can to use System.Xml.XmlDocument class to read your XML source. You'll also need to declare all namespaces required to deal with that XML content.

Rubens Farias
+1  A: 

You need to use the Word APIs. This is more complicated than you think.

Word 2003 files (.doc, dot) are stored in a proprietary, binary format. Reading this format by reading the specification is near impossible, and it's well worth it to invest in an SDK for this, or to connect directly to Word through COM to handle the processing.

Word 2007 files (.docx, .dotx) are indeed in XML, but a .docx file is actually a zipped heirarchy of folders and files creating the document in pieces. For this, the OpenXML SDK can handle .docx, and I assume can also handle their equivalent templates.

An alternative for the 2007 format is to create your template using Word, and learn the heirarchy of files and handle them appropriately. Change the .docx or .dotx extension to .zip, unzip, and find where your find-and-replace tags are located. You may be able to just replace the tags, rezip the heirarchy and rename the extension.

Will Eddins
+1 for OpenXML SDK link
Dennis Palmer
A: 

First of all, I think Regex should be just fine.

But if you really want to use an XML parser I love XmlDocument/XmlNode in .NET. The two functions SelectSingleNode and SelectNodes are infinitely useful. Unfortunately, I do not have a Word XML example in front of me, so let's assume this XML:

<Document>
  <MergeField name="phone"></MergeField>
  <MergeField name="email"></MergeField>
</Document>

You would then use code as follows:

XmlDocument wordDoc = new XmlDocument();
wordDoc.Load(fileName);

XmlNodeList mergeNodes = wordDoc.SelectNodes("//MergeField");

foreach(XmlNode mergeNode in mergeNodes)
{
   string fieldName = mergeNode.Attributes["name"].Value;
   // Do something here based on field name
   // e.g.:

   mergeNode.InnerText = GetFieldValue(fielName);
}

doc.Save(fileName);

The tricky part is that Word XML uses XML namespaces all over the place, so you need to use the XmlNamespaceManager class is .NET to tell the XML document which namespace is which, so it would be more like:

XmlDocument wordDoc = new XmlDocument();
wordDoc.Load(fileName);

XmlNamespaceManager nsm = new XmlNamespaceManager(doc.NameTable);
nsm.AddNamespace("o", "http://somenamepaceurl.com");
XmlNodeList mergeNodes = wordDoc.SelectNodes("//o:MergeField", nsm);

foreach(XmlNode mergeNode in mergeNodes)
{
   string fieldName = mergeNode.Attributes["name"].Value;
   // Do something here based on field name
   // e.g.:

   mergeNode.InnerText = GetFieldValue(fielName);
}

doc.Save(fileName);
mjmarsh
This won't work unless his doc is XML.
John Saunders