views:

974

answers:

2

Our client gets Excel files in Excel's HTML format.(That's the way it arrives; nobody has any control over that.) We then need to run a report based on the data in the file. Using ADO.NET (OleDbReader) results in a "External file not in expected format" exception.

If the data is converted to regular Excel format it gets read in OK. However this is not really a solution as it imposes an extra step they must take and they are not too computer-literate at the best of times.

The only solution I could come up with was to use Excel Automation to create a new spreadsheet, fill it with the same data, and read that one instead. But ADO.NET seem to only be able read from a file on disk. I could of course save the file and delete it when I'm done with it (which I've verified will work). But I'm uncomfortable with the idea of messing around with their filesystem. So my first question is - is there a way to fill a DataTable from an in-memory Excel Worksheet?

Additionally, I don't like the whole business with using Automation; it's incredibly slow. The operation takes over 30 seconds even without filling the DataTable. So a solution that makes it any slower is not going to be any good. That brings me to my second question - Is there a better way to accomplish what I'm trying here?

A: 

I'm not sure what you mean by "Excel's HTML format". Recent versions of Excel have an XML file format, and I Excel can open an HTML file containing a table and convert it to a worksheet, but Í don't know of any specific Excel HTML format.

Regarding the solution using Excel Automation, once you have a Worksheet in memory, you can get the values into a 2-dimensional array of objects using the Value2 property, then use that to build a DataTable. I don't think doing so will add much additional overhead on top of the initial overhead of using Automation (which needs to create an Excel process).

Is there a better way? Parsing arbitrary HTML is not trivial, but if the files you receive have a consistent format it may be possible to parse them.

Joe
+1  A: 

Try the HTML Agility Pack: http://www.codeplex.com/htmlagilitypack

I use it in a similar scenario. In my case:...

  • someone pasted a table from excel into the clipboard
  • retrieve the HTML text
  • use HTML Agility back to find the TABLE, TR, TH, TD tags
  • and then construct the DataTable from it

At no time in my case is the HTML persisted to disk

namenlos