views:

711

answers:

5

I need to convert Word binary documents (version 97 up to 2003) into HTML documents programatically. I have googled for 3rd party libraries but most results are junk built on top of System.IO.Package which, ofcourse, are useless for other word documents except Word 2007 - Office Open XML.

Do you know a good tool / library for .NET to programatically convert them ?

Later edit 1 : COM Interop with the Word application is out of the question because this doesn't scale AT ALL. I have tried to generate XLS files and it takes seconds on a multi-core servers per small file. Plus it requires you to have Office installed on the server and sometimes you could have modal dialogs pop up on the server during conversion/interoperability which would hang all processes related to office.

A: 

I don't know of any conversion libraries specifically, but you could try using Word's COM interface and a language that can use COM to "parse" the document and extract out the text and formatting. It doesn't even have to be C#, I'd use something like Ruby because it's quick to develop in and has reasonable COM interaction capabilities.

There should also be a way of using COM to activate Word's own "Save as HTML" option for the document.


Found this documentation that provides what you essentially need. Just use the save method on that object and specify the format to be a HTML variant.


Apart from using COM there's a Java based library POI that interacts with Microsoft formats. It's not .Net but it's something else that we use at work to interact with excel files (the other being COM).

Daemin
A: 

COM Interop with the Word application is out of the question because this doesn't scale AT ALL. I have tried to generate XLS files and it takes seconds on a multi-core servers per small file. Plus it requires you to have Office installed on the server and sometimes you could have modal dialogs pop up on the server during conversion/interoperability which would hang all processes related to office.

Andrei Rinea
You could have mentioned the server requirement in your question.
Daemin
Ooops.. :"> I'll add it now..
Andrei Rinea
+5  A: 

Be careful when you consider using Office Client automation: Not only it doesn't scale well, be you can have a lot of problems with memory, macro, popup...

From http://support.microsoft.com/kb/257757:

Microsoft does not currently recommend, and does not support, Automation of Microsoft Office applications from any unattended, non-interactive client application or component (including ASP, ASP.NET, DCOM, and NT Services), because Office may exhibit unstable behavior and/or deadlock when Office is run in this environment.

Other ways are:

  • OpenXML (which I recommend if you can switch to Office 2007 documents)
  • Using third party libs (like Aspose)
Nico
Thnaks, I went with Aspose... It works great!
Andrei Rinea
+1  A: 

I have come across WV which is a C library that reads some Microosoft Word formats (and is used for document conversion in AbiWord). It is in C so you could maybe run it as a daemon, or wrap it in COM and use interop. It looks lik eit would need at least a little work before it was useful to you.

You could also look at whether OpenOffice.org has a separate converter component.

pdc
A: 

Thank you, I went for Aspose.Word. It works like a charm :)

Andrei Rinea