views:

74

answers:

5

I am writing a stored procedure to handle the populating of a database from csv files. There about 150 tables and there will be 1 CSV file per table.

There is some added complication in that data needs to be validated against a validation table on import.

I am currently writing a single Store Procedure, that will take the csv file name, table name and validation table name as parameters and then handle the import from any csv file to any table. Obviously this saves having 150 store procedures, one for each table.

But my question is, is this the right way to do this? Having this generic procedure involves using quite a bit of dynamic SQL and from what I have been reading writing stored procedures where you don't know what tables you will be using, the schema of the tables etc until run time, is a bad thing. It will also require the use of a global temporary table. But then again writing 150 stored procedures that do a similar thing is also not a great solution, and that is only going to confuse the person doing the import when they have to know what stored procedure to call for which table.

This import will be a one time thing, so performance is not so much an issue, and the only persons running it will have full permissions to the DB, so security is not much of an issue. I'd like to hear your thoughts on what the best method is here.

+4  A: 

The correct way to do this in SQL Server is an SSIS package. It has all sorts of ETL tools built in. But as this is only a one time thing I would probably do it in the way I knew I could code it fast and efficiently. I would write a bit of .NET. In your case it might be a sproc.

Chris Diver
I would second the use of SSIS. There is a nice 'for-each over *.csv in a directory' component you can use such that you can make a lot of it generic.
Joel Mansford
+1  A: 

"Obviously this saves having 150 store procedures, one for each table"

I would dispute "saves". There's no "saving" here. You've made 150 simple procedures into one complex procedure. The complexity of the one procedure outweighs any "cost" of multiple simple procedures.

"so performance is not so much an issue"

This is always true with stored procedures. They're slow. Do some measurements. You'll find that Java or C# will be just as fast, and often faster than a stored procedure.

"But then again writing 150 stored procedures that do a similar thing is also not a great solution,"

Why?

You still have ordinary software design -- common code can still be factored out into procedures or functions that the 150 load procedures call. Each table is not an unthinking copy and paste of the load process. You should be able to reduce each stored procedure to just some minimal unique processing: validation and insert.

"This import will be a one time thing"

Famous last words. That's never true. When you do software upgrades and the database structure changes folks will suggest that they dump the old data into csv's and use those stored procedures -- slightly modified -- to load the new database. You'll be using these regularly.

Nothing is a "one-time thing".

S.Lott
`delete from customers` in production is a one time thing :-)
Adam
I disagree that SQL Server Stored Procedures are slow. If they're slow for you, slower than SQL submitted directly as a batch, it likely means that the first query you made used atypical parameters and caused a bad query plan to be cached. If the selectivity of the parameters is going to be very different you should create it WITH RECOMPILE, but you lose the benefits of a cached query plan.A heavily-branching procedure that doesn't process many rows probably is better off in a compiled language, though.
Mike Dimmick
@Mike Dimmick: Since these all read external CSV files, and have branching logic, a *measurement* is essential to determine speed. Every time I've measured SP's, they're slower than Java doing the same logic.
S.Lott
@Adam: Actually, it isn't a one time thing. While database structure changes, this *does* happen more than once. An "export, process off-line, delete and reload" *does* happen more than once.
S.Lott
@S.Lott of course, I was joking - but thanks for the clarification :-)
Adam
OK, I agree that this will be one complicated SP, rather than 15 relatively simple ones, but it seems odd to me to write 150 stored procedures that are essentially doing the same thing, just on different tables, but this aside the big issue is that when the IT team come to importing the data, they will need to understand what SP maps to what table, and with 150 of these this could become very confusing.
Sam Cogan
@S.Lott I think the point was that it is a one time thing because you will get fired for executing that SQL in production. I prefer `TRUNCATE TABLE customers` myself :D
Chris Diver
@Chris lol, yes when attempting to clear all customer tables, make sure its a non-transactable command!
Adam
@Sam Cogan: "they will need to understand what SP maps to what table". How can that be "very confusing"? You have the name of the SP as a hint. SP's have comments, that's a hint. You can create a list of SP and table, that's a hint. You have specifications, that's a good source of information. What's "confusing"? Are you saying that your IT team are too dumb to look at the names, comments, source code or specifications? How dumb are they? I'm sure they do NOT want to maintain the super-load-everything-from-one -procedure SP.
S.Lott
A: 

I'd tend to use bcp to Bulk Import into the tables, if they're a direct mapping. Even if not, we often use bulk import into 'raw' tables and then run a procedure to process the imported data into the main tables.

If the incoming data isn't easy to process with bcp - although it can handle CSV fine - my next choice would be a .NET SqlBulkCopy object.

A stored procedure does give you a security boundary - you can set permissions on it so only authorized users can run it - but it doesn't give a huge performance benefit if it's not run very often. Indeed it can slow performance overall if SQL Server winds up caching the query plan at the expense of dynamic queries, and memory is tight enough to cause those dynamic query plans to be discarded.

Mike Dimmick
+1  A: 

If you want a code based solution then I'd set the tables up in SQL in the format you expect to find your CSV file.

I would then write code to dynamically generate the SQL based ont he table structure along with validating the csv file row by row against this structure to give some sensible feedback when it fails.

Easy. Then you have 1 piece of code that uses the database its storing into as its schema basis ... a bit of a driver to point csv file name to table name in database.

maintaining 150 stored procedures sounds like a total ball ache which is more trouble than its worth. Refactoring is supposed to make life more pleasant, not into a maintenance headache.

All is good and extensible also.

Usually you dont want to generate sql as debugging it is difficult but you cna write this very piece by piece and make it highly testable ... as you are not accepting untrusted input the usul sql injection concerns can be mitigated.

CSV importers always attempt to start simple but in reality i find need 5 steps

  • 1) you ned to read various data source types
  • 2) you need to read various file formats
  • 3) need to convert these row by row into some internal representation and validate the data
  • 4) need to convert the internal representation into its output dataformat
  • 5) write it to a target.

and you wnt this driven by a database

Then you can deal with multiple similar feeds that will be stored in a similar location and also single feeds that need to be sent to multiple similar locations.

John Nicholas
A: 

I think that maybe db stored procs are no the simplest solution for one time jobs. I mean reading/parsing CSVs and adding that data into a table.. this kind of thing would be much simpler in a higher level language (Java, Python, etc..). IT would not be very complex to write it and more importantly, maintain it. Of course, if your pl/sql (or whatever equivalent) is strong, go for the one single complex one.

I cannot see any reason to have 150 different stored procs doing approx. the same thing. What happen if there will be 20 more tables to add? IT will become a stored proc nightmare!!

Java Drinker