views:

65

answers:

0

I need to parse report files that are generated in an old version of the SpreadsheetML formats (See the HTML Header of the file below) in order to merge multiple single-page reports into a single tabbed workbook with all formatting and contents intact.

<HTML xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="-//W3C//DTD HTML 4.0//EN">

This has proven extremely difficult as the documents are not well formed XML and no 3rd party tools I have found (Including Syncfusion and Spreadsheet Gear) support the format. Further to this, I am unable to use Excel as this needs to be an unattended, windows service implementation.

I have had limited success with a custom parsing solution using a combination of Regex and the HTML Agility parser and using Syncfusion XLSIO component to build an output document. This approach is not sufficiently performant and will be difficult to maintain.

Is there a more standard way to read these documents without invoking Excel? Any help is appreciated.

Michael