tags:

views:

360

answers:

3

I have an application that imports large volumes of data daily, several 100 thousands records.
Data comes from different sources. The data is read using C#, then bulk inserted into the database.

This data is then processed:

  • different tables are linked
  • new tables are generated
  • data is corrected using complicated algorithmns (totals of certain tables have to total zero)

Most of this processing is done in stored procedures.
Although some of the complex processing would be simpler in C#, the extraction of the data into a dataset and its reinjection would slow things down considerably.
You may ask why I do not process the data before inserting it into the database, but I do not think it practical to manipulate 100,000s of records in memory, and the SQLs set based commands help when creating lots of records.

This will probably spark up the age old question of using stored procedures and their pros and cons. (eg. How do you unit test stored procedures?)

What I would like in response, is your experience with large volumes of data and how you tackled the problem.

+1  A: 

I would use SSIS or DTS (assuming you are talking about MSSQL). They are made for that purpose and work with SPs if you need them.

Another option is to preprocess the data using Perl. Even though it sounds like a wierd suggestion, Perl is actually extremely fast in these scenarios. I've used it in the past to process billions of records in reasonable time (i.e. days instead of weeks).

Regarding "How do you Unit Test store procedures", you unit test them with MBUnit like anything else. Only bit of advice: the setup and rollback of the data can be tricky, you can either use a DTS transaction or explicit SQL statements.

Sklivvz
+1  A: 

I would generally have to agree with Skliwz when it comes to doing things in MSSQL. SSIS and DTS are the way to go, but if you are unfamiliar with those technologies they can be cumbersome to work with. However, there is an alternative that would allow you to do the processing in C#, and still keep your data inside of SQL Server.

If you really think the processing would be simpler in C# then you may want to look into using a SQL Server Project to create database objects using C#. There are a lot of really powerful things you can do with CLR objects inside of SQL Server, and this would allow you to write and unit test the code before it ever touches the database. You can unit test your CLR code inside of VS using any of the standard unit testing frameworks (NUnit, MSTest), and you don't have to write a bunch of set up and tear down scripts that can be difficult to manage.

As far as testing your stored procedures I would honestly look into DBFit for that. Your database doesn't have to be a black hole of untested functionality any more :)

Josh
A: 

Where you process data depends greatly on what you're doing. If you need, for example, to discard data which you don't want in your database, then you would process that in your C# code. However, data to process in the database should generally be data which should be "implementation agnostic". So if someone else wants to insert data from a Java client, the database should be able to reject bad data. If you put that logic into your C# code, the Java code won't know about it.

Some people object and say "but I'll never use another language for the database!" Even if that's true, you'll still have DBAs or developers working with the database and they'll make mistakes if the logic isn't there. Or your new C# developer will try to shove in data and not know about (or just ignore) data pre-processors written in C#.

In short, the logic you put in your database should be enough to guarantee that the data is correct without relying on external software.

Ovid