Excel file parsing/scraping using .NET

views:

answers:

+2 Q:

Excel file parsing/scraping using .NET

Hi experts am trying to parse an excel file. its structure is very complex. The possible way i know are.

Use Office introp libraries
Use OLEDB provider and read excel file in a dataset.

But the issue is of its complexity like some columns,cells or rows blank etc.

What are the best possible ways to do this ?

thanks in advance.

I have my users first save the Excel spreadsheet as a CSV file. Then they upload the CSV file to my app. That makes it much simpler to parse.

David 2010-07-13 18:44:11

+1 A:

Personally i prefer to either use the OLEDB way, which is a bit clunky at best at times, or you can use a third party library that has put in the time/effort/energy to get access to the data.

SyncFusion has a pretty nice library for this.

Mitchel Sellers 2010-07-13 18:44:14

I've used OLEDB myself to read uploaded Excel files, and its presents no real problems (except for nulls in fields, instead of blanks, which can be checked with IsDBNull). Also, third party open source tools like NPOI and Excel2007ReadWrite (http://www.codeproject.com/KB/office/OpenXML.aspx) can be useful.

I have thoroughly evaluated both of these third party tools, and both are pretty stable and easy to integrate. I would recommend NPOI for Excel 2003 files, and Excel2007ReadWrite for Excel 2007 files.

NJITBEN 2010-07-13 19:18:46

+1 A:

I can recommend the ExcelDataReader (licensed under LGPL I think). It loads both .xls and .xlsx files, and lets you get the spreadsheet as a DataSet, with each worksheet being an individual DataTable. As far as I know from the scenarios I have used it in, it honours blank rows, empty cells, etc. Try it and see if you think it will handle your "very complex" structure. [I do notice one negative review on the site - but the rest are pretty positive. I've experienced an issue reading .xlsx if a worksheet is renamed]

I've also used the OLEDB approach in the past, but be warned that this has real problems in the way it tries to infer datatypes in the first few rows. If the datatype changes for a column, then this may well infer it wrongly. To make matters worse, when it does get it wrong, it will often return null as the value, making it difficult (or impossible) to tell a true null value from a datatype that changed after the first six or seven rows.

Rob Levine 2010-07-13 19:25:08

It sounds like you have a good understanding of the task at hand. You'll have to write business logic to untangle the complexities of the spreadsheet format and extract the data you're looking for.

It seems to me that VTSO/Interop is the best platform strategy for 2 reasons:

Access to the spreadsheet data will be a small part of the effort needed for your solution. So if using OLEDB saves a little time in data access, it will probably be irrelevant in terms of the overall project scope.
You may need to examine the contents of individual cells closely and take context information like formatting into account. With interop, you get full visibility of cell contents, context, and other sheet level context information like named ranges and lists. It is a risk to assume you won't need this type of information while decoding the spreadsheet.

Paul Keister 2010-07-13 19:26:20

ansaurus

tags:

views:

answers:

Excel file parsing/scraping using .NET

related questions