views:

503

answers:

4

For whatever reason, I have a lot of clients that have existing data that's stored in spreadsheets. Often there are hundreds, if not thousands of items in each spreadsheet, and asking the client to manually enter them through a website (or heaven forbid importing them that way myself) is out of the question. Typically, this data doesn't simply map spreadsheet column to database column. That would be too easy. Often, the data needs to be manipulated before going into the database (data needs to be split by commas, etc) or the data needs to be spread out across multiple tables. Or both.

I'm asking this question, not because I don't know of a multitude of ways to do it, but because I haven't settled on a way that doesn't feel like it takes more work than it should. So far I've taken all of the following approaches (and probably more that I've forgotten):

  • Using Excel to modify the data, so it's a little bit easier to import
  • Importing the entire spreadsheet into a temporary table and then importing with SQL
  • Writing a script and importing the data with it (I've used VBScript, C# and now Ruby)

So far, using a script has been the way that seemed most flexible, but it still feels a little clunky. I have to perform this task enough that I've even contemplated writing a little DSL for it, just to speed things up.

But before I do that, I'm curious, is there a better way?

+2  A: 

yeah.. that just sucks.

I would go with the script. And I assume you have repeating columns that have to match a single row in another table. I would do reasonable matching and if you encounter a row that the script can't deal with and move the data...then log it and make someone do it manually.

Arthur Thomas
I do have repeating columns like you describe, and reasonable matching is pretty much what I've told clients that I can do. Usually, if they can't provide some kind of unique id to match on, they understand that things can't always match up perfectly.
Matt Ephraim
+2  A: 

It's the little details that'll kill you on this, of course, but in general, I've had success with exporting the data as CSV from Excel, then reading it using a rool or script, munging it as needed, and inserting it. Depending on the wonderfulness of my environment, that can be done with a data base interface to the scripting language, down to and including writing SQL INSERT statements into a script file.

There are good CSV packages available for Python, Ruby, and Perl.

Charlie Martin
Ruby actually has a library called (oddly enough) Spreadsheet that has worked pretty well for importing directly from Excel as well. I had problems with the ParseExcel library (which is a Perl port I think), but Spreadsheet builds off ParseExcel and seems to have fixed a lot of the issues.
Matt Ephraim
+2  A: 

You have to set boundaries, if you can. You should try and provide a template for them to use with the expected data, which includes file type (Excel, csv, etc.), column names, valid values, etc. You should allow the used to browse for the file and upload it on your page/form.

Once the file is uploaded, you need to do validation and importation. You can use ADO.NET, file streams, DTS/SSIS, or Office Automation to do this (if you are using the Microsoft stack). In the validation portion, you should tell the user exactly what they did wrong or need to change. This might include having the validation page have the actual data in a datagrid and providing red labels with errors on the exact row/column. If you use Office Automation, you can give them the exact cell number, but the Office PIA is a pain in the neck.

Once validation is accepted, you can import the information however you like. I prefer putting it into a staging table and using a stored proc to load it, but that's just me. Some prefer to use the object model, but this can be very slow if you have a lot of data.

If you are personally loading these files manually and having to go in and manipulate them, I would suggest finding the communality among them and coming up with a standard to follow. Once you have that, you can make it so the user can do it themselves or you can do it a lot faster yourself.

Yes, this is a lot of work, but in the long wrong, when there is a program that works 95% of the time, everybody wins.

If this is going to be a situation which just can’t be automated, then you will probably just have to have a vanilla staging table and have sql to to the importation. You will have to load the data into one staging table, do the basic manipulation, and then load it into te staging table that your SQL expects.

I’ve done so many imports and ETL tools, and there really is no easy way to handle it. The only way is to really come up with a standard that is reasonable and stick to it and program around that.

Charles Graham
Most clients are pretty good about following a standard if the imports are going to happen frequently. The biggest problems come with importing data that only needs to get imported once (prior to a site launch or when new features are being added etc).
Matt Ephraim
That seems to be a one off to me. And those are the times when we hate our jobs.
Charles Graham
I've used MSSQL's DTS for this in the past and it has worked well. You should have the DTS package write to a log in the database so you know when it succeeds or fails.
Diodeus
A: 

A DSL is the way to go.

Create a domain model for your problem. You talk about cells, columns, rows, database tables, splitting fields, combining fields, mapping from cells to database columns, so that are the concepts you need. In addition you probably want ranges (of cells), and sheets.

A simple view looks only at the values in the spreadsheets, not the underlying formulas. Exporting the spreadsheet as tab-separated text gives you access to that. If you need access to the formulas, you're better of with the xml representation, either the XML-spreadsheet, or the Office XML format.

You might be able to come up with a DSL in Excel. That could allow your smarter users to do (part of) the mapping.

Stephan Eggermont