views:

603

answers:

1

I need an automated process for creating docx files from xhtml source. The xhtml files contain images (<img> elements) whose "src" attributes point to an external reference. But the docx files need to be readable without a network connection, so I need to find a way to embed the images directly into the docx package (namely, in the /media folder).

So far I've used the altChunk method (as described by Eric White) to create the .docx file. I had hoped to use the OpenXML SDK to insert the image parts into the package. But to do that I need to insert paragraphs (<p> nodes) into the document. Unfortunately the document part contains nothing but a reference to the altChunk (stored separately in the docx package). Of course, once the docx is opened, edited and saved, the altChunk part is removed and it’s contents are embedded properly in the document.xml. But I don’t know of any way to do that programatically, so that doesn't help.

Other options I’ve considered:

  1. Partitioning the xhtml into segments, separated between each image, then adding each altChunk one at a time, with the appropriate image reference between each one. (Tedious but seems possible)
  2. Inserting the images into the media folder, and then find way to embed WordProcessingML directly into the xhtml so that the <img> references the packaged image file. (Questionable at best) Can anyone think of a better approach?
A: 

Well, I sorta solved my own problem: I decided to convert the document to mHtml (which can contain images embedded directly in the file) and then use the altchunk to create the final docx file. However, I still wanted to do some post-processing on the file (to insert endnotes in the Word document), but as mentioned above, this is not possible until after the altchunk has been transformed into docx, which cannot be done programmatically.

So it dawned on me that I could bypass the altchunk path altogether and simply use mHtml as the "gateway" from xHtml to docx. I just transformed the xHtml into mHtml, complete with embedded images and endnotes, then renamed the file with a .doc extension. The resulting document can be opened directly by Word (and will be converted more properly on subsequent save). So far it works great (albeit with some bugs in Mac's version of Word, as well as Word2003).

kmote
I wasn't familiar with MHTML (MIME HTML), so I did a search on Google, which lead me to the Wikipedia page: http://en.wikipedia.org/wiki/MHTML
Matt Passell
I meant to ask in my last comment how you converted the document to mHTML. Did you do it programmatically or using an app like Word? I'm looking to do the same overall conversion programmatically.
Matt Passell