tags:

views:

414

answers:

6

I am currently working on a project for traversing an excel document and inserting data into a database using C#.

The relevant data for this project is:

  • The excel sheet has 14 rows at the top that I do not care about. (sometimes 15, see Russia/Siberia below)
  • The data is grouped by name into 2 columns (date and value), such as:

Sheet 1

USA                        China                      Russia  
Date         Value         Date         Value         Siberia           
1/1/09       4.3654        1/1/09       2.7456        Date          Value        
1/2/09       3.5545        1/3/09       9.3214        2/5/09        0.2454
1/3/09       3.2322        1/21/09      5.2234        2/6/09        0.5557
  • The name I need to acquire is whichever is listed directly above "Date".
  • I only care about data from dates we do not have in the database. Before each column set is parsed, I will acquire the max date for any given name from the database, and skip anything at or before it.
  • There is no guarantee that the columns will be in a constant order or have constant spacing.
  • I do not want data for all names, rather only those in a list I put together before the file is acquired.

My current plan is this:

  • For each column, if the date field is at row 16, save the name as the value in row 15 above it, check the database for the last date for that name, only insert data where the date is greater than the acquired date.
  • If the date field is at row 17, do the same thing, but start the for loop through each row at 18.
  • If the name is not in the list, skip the column. If it is, make sure to grab the column next to it for the necessary values.

My problem is:

  • I am currently trying to use the ExcelDataReader from Codeplex(http://www.codeplex.com/ExcelDataReader). This only likes csv-like sheets, which this project has not.
  • I do not know of any alternative Excel readers.
  • To the best of my knowledge, a straight FileStream traversal of this file can only go row-by-row, rather than column-by-column.

To anyone still reading, thank you for your time. Any recommendations on how to proceed? Please ensure that solutions can traverse each column, not each row.

Also, please don't worry about the database stuff, or the list of names that precedes the traversal.

Addendum: What I'd really like to end up with is some type of table that I can just traverse with a nested loop, making column-centric traversal much, much easier. Because there is so much garbage near the top of the sheet (14+ rows), most simple solutions are not feasible.

A: 

I highly recommend saving this Excel document in a CSV format before doing anything else with it. You can do using this code After you have a CSV, you can either parse it using that library, or write your own parser for it.

Hamish Grubijan
A: 

Not a straight answer to your question but an alternative idea:

Your data looks like a pivot-ish table. I'd recommend "unpivoting" it into simple table.

Example:

           Russia      USA 
Q1            123       323
Q2            456       321
Q3            567       843

Becomes:

Quarter Country  Value
Q1      Russia    123       
Q1      USA       323 
Q2      Russia    321
....

If that is the case, not sure if I got this right in your question, than processing the data using a OleDB driver or whatever CSV kind of stuff should be become much less painful.

Alex
A: 

You can access Excel directly using ADO.NET via the ODBC driver. See http://www.davidhayden.com/blog/dave/archive/2006/05/26/2973.aspx or Google for more info on how to do that. You may wish to try HDR=No in your connection string, since your first row isn't really proper headers by the looks of it.

I haven't done this for a while, but I remember that it is a bit "temperamental" and takes some playing around with to get the column names right, but it should work. Try SELECT * FROM [Sheet1$] and see what you get.

Evgeny
But assuming that there are many rows above the name that are garbage, would this method still work?
Norla
It might. You'd have to see what Excel returns to you and figure out how to detect what's garbage and what's not.
Evgeny
A: 

As I did before, I prefer to use OLEDB connection in order to connect to an Excel document.

By the way, you can take a look at the following article for more information: http://www.codeproject.com/KB/office/excel%5Fusing%5Foledb.aspx

Ramezanpour
From the link: "How does it happen?Apparently, the engine reads the first 8 cells of each column and check it's data type. if most of the first 8 cells are int / double, the problem remains. " I have a minimum of 14 rows before I get to real data. And yes, these do have other data (garbage) scattered throughout them.
Norla
As far as I know, there's no problem with the solution about. Maybe I can't understand your problem but if your problem is about to reading data from an Excel document, OLEDB is a solution :-)
Ramezanpour
+2  A: 

If you want to read from excel in C#, i've used this library with great success, it'll give you the flexibility to parse columns/rows just however you'd like:

Other open source libraries i haven't used but could be good:

Alternatively, you can use one of the many good Java libraries, and convert it into a C# assembly using IKVM:

I've covered how to do the IKVM Java -> C# conversion here (it's really not as horrible an option as you think):

Chris
I look forward to trying koogra and nexcel. I'll let you know how they work out for me.
Norla
I've been using koogra for more than a year in a production system parsing half a dozen excel files daily, it really works quite well, however its documentation is nonexistent. Get in touch if you get stuck with it.
Chris
+1 Chris, POI via IKVM isn't such an unwise option as it first appears. It's a very robust library.
Mark Nold
POI.NET is definitely dead. NPOI is very much alive and robust. Definitely a better option than POI + IKVM
Nate
A: 

SpreadsheetGear for .NET can load workbooks and access any cells on any sheet in any order. You can get the formatted text of the cell (such as "1/1/09") or the underlying value ("1/1/09" is stored as the double 39814.0 in Excel or SpreadsheetGear).

You can see some live ASP.NET samples here and download the free trial here if you want to try it yourself.

Disclaimer: I own SpreadsheetGear LLC

Joe Erickson
Not gonna lie. I wish I had a disclaimer like that. I am, however, looking for something more... free-ish.
Norla