views:

712

answers:

1

I'm trying to convert the contents of a Microsoft Word (.doc) file into nicely formatted XHTML using C#, .NET 2.0 and the Microsoft.Office.Interop.Word namespace. This is just a little exe that I'm building, which I can hopefully integrate into our automated build process. The reason I'm doing this is because Word's built in "Save as Web Page" does a horrible job of the HTML generation. I'm using Microsoft Word 2003.

I've looked around for resources in this respect but beyond the MSDN reference (http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word%28office.11%29.aspx) and a few tutorials on how to create word docs (not read them in a manner that allows creation of another format) I'm coming up blank.

So far, I have a little app that will loop through all the paragraphs in the opened Word document and wrap their text in HTML paragraph tags and output this to an HTML file. It would appear that, in Word, everything is treated as a paragraph, so I'm finding that there's no way to determine if the current paragraph is a list, table, header, etc. There are separate collections for Tables, Lists, etc, but there doesn't seem to be a way (that I've found) to derive an order from the contents of the object model that the Microsoft.Office.Interop.Word namespace provides.

Firstly, is anyone aware of any good resources for using the Microsoft.Office.Interop.Word namespace to do what I'm trying to achieve?

Secondly, I am trying to reinvent the wheel here (foregoing my previous explanation as to why I'm not using the "Save as Webpage" feature) or barking up the wrong tree in terms of my choice of approach/technology/code library?

I know that MS Office 2007 and beyond have increasingly better support for the Office Open XML format (http://en.wikipedia.org/wiki/Office_Open_XML), so assuming thats any good an XML transform may be possible.

Also, there are apparently some good products out there for doing the type of thing I'm describing but there doesn't appear to be any good open source alternatives.

A: 

I hate Interop. Interop feels like a cludge because it is a cludge.

Can you open the document in Word 2007, "Save As" -> "Other Formats", pick one of the XML formats, and process the resulting XML with System.XML? Transforming from one XML document to another XML document is going to be a lot easier than mucking about with Interop.

quillbreaker
Sorry, should have mentioned that I'm working with Word 2003. I think I've got the interop thing sorted, it's just a case of determining how I can bend the object model to my needs.
A. Murray