views:

496

answers:

5

We are building a new application in .net 3.5 with SQL server database. The database is fairly large having around 60 tables with loads on data. The .net application have functionality to bring data into this database from data entry and from third party systems.

After all the data is available in database the system have to do lots of calculation. The calculation logic is pretty complex. All the data required for calculations is in database and the output also needs to be stored in database. The data gathering will happen every week and the calculation needs to be done every week to generate required reports.

Due to above scenario I was thinking do all these calculations using Stored Procedure. The problem is we need data independence also and stored procedure will not be able to provide us that. But if I do all this in .net by query database all the time, I don't think it will be able to finish the work quickly.

For example, I need to query one table which will return me 2000 rows then for each row I need to query another table which will return me 300 results than for each row of this I need to query multiple tables (around 10) to get required data, do the calculation and store the output in another table.

Now my question should I go ahead with stored-procedure solution and forget about database independence since performance is important. I also think development time will be much less if we use stored procedure solution. If any of client want this solution on say oracle database (because they don't want to maintain another database) then we port the stored procedures to oracle database and maintain two versions for any future changes/enhancements. Similarly other clients may ask for other databases.


The 2000 rows which I mentioned above is of product skus. The 300 rows I mentioned is of different attributes which we want to calculate, e.g. handling cost, transport cost, etc. The 10 tables I mentioned have information about currency conversion, unit conversion, network, area, company, sell price, number sold per day, etc. The resulting table stores all the information as a star schema for analysis and reporting purpose. The goal is to get any minute information about the product so that one know what attribute of a product selling is costing us money and where we can do the improvement.

+3  A: 

Well without any specific details of what data you have in these tables, just a back of the napkin calculation shows that you're talking about processing over 6 million rows of information in the example you provided (2,000 rows * 300 rows * (1 row * 10 tables)).

Are all of these rows distinct, or are the 10 tables lookup information that has a relatively low cardinality? In other words, would it be possible to make a program that has the information from the 10 lookup tables in memory, and then just process the 300 row result set in memory to perform the calculations?

Also, I would be concerned about scalability -- if you do this in a stored procedure, it is guaranteed to be a serial process limited by the speed of the single database server. If you have the possibility of multiple copies of a client program, each processing a chunk of the 2,000 initial record set, then you can perform some of the calculations in parallel perhaps speeding up your overall processing time, as well as making it scalable for when your initial record set is 10 times larger.

scwagner
All the rows are distinct so lookup tables in memory is of no help. I did think over parallel processing in chunk but ultimately the number of database transaction is same so I don't think I will get any benefit.
Bhushan
+1  A: 

Programming things like calculation code tend to be easier and more maintainable in C#. Also, normally keeping processing on the SQL Server to a minimum is a good practice since the database is the hardest to scale.

Having said that, from your description it sounds like the stored procedure approach is the way to go. When calculation code is dependent on large volumes of data, it's going to be more expensive to move the data off server for calculation. So unless you have reasonable ways of optimizing the dependent data (such as caching lookup tables?) then you are most likely going to find it more painful then it's worth to not use a stored proc.

jwanagel
+3  A: 

I wouldn't consider doing the data manipulation anywhere other than in the database.

most people try to work with database data using looping algorithms. if you need real speed, think of your data as a SET of rows and you can update thousands of rows within a single update. I have rewritten so many cursor loops written by novice programmers into single update statements where the execution time was massively improved.

you say:

I need to query one table which will return me 2000 rows then for each row I need to query another table which will return me 300 results than for each row of this I need to query multiple tables (around 10) to get required data

from your question it looks like you are not using joins, and you are already thinking in loops. even if you do intend to loop, it is much better to write a query to join in all data necessary then loop over it. remember update and insert statements can have massively complex queries driving them. include in CASE statements, derived tables, conditional joins (LEFT OUTER JOIN) and you can just about solve any problem in a single update/insert.

KM
I am not looking for implementation at present, the information I gave is just to give an idea of the task I want to achieve. I will keep your suggestion in mind when we start implementing. Currently I want to know if I go with stored procedures or extract information in .net application.
Bhushan
+1  A: 

Stored procedures every time, but as KM said within those stored procedures keep those iterations to minimum that is to say use joins in your SQL, relational databases are soooooo good at joining.

Database scalibility will be a small issue especially as it sounds like you'd be performing these calcualtions in a batch process.

Database independence doesn't really exist except for the most trivial of CRUD applications so if your initial requirement is to get this all working with SQL Server then leverage the tools that the RDBMS provides (after all your client will have spent a great deal of money on it). If (and it's a big if) a subsequent client really really doesn't want to use SQL Server then you'll have to bite the bullet and code it up in another flavour of stored procedure. But then as you identifed: "if I do all this in .net by query database all the time, I don't think it will be able to finish the work quickly." you've defered the expense of doing it until if and when required.

DJO
A: 

I would consider doing this in SQL Server Integration Services (SSIS). I'd put the calculations into SSIS, but leave the queries as stored procedures. This would provide you database independence - SSIS can process data from any database with an ODBC connection - as well as high performance. Only the simple SELECT statements would be in stored procedures, and those are the parts of the SQL standard most likely to be identical across multiple database products (assuming you stick to standard forms of query).

John Saunders