views:

111

answers:

6

Hi, I'd like to start a discussion about the implementation of a database system.

I'm working for a company having a database system grown over ca. the last 10 years.

Let me try to describe what it's doing and how it's implemented:

The system is divided into 3 main parts handled by 3 different teams.

  1. Entry: The Entry Team is responsible for creating GUIs for the system. In the background is a huge MS SQL database (ca. 100 tables) and the GUI is created using .NET. There are different GUI applications and each application has lots of different tabs to fill in the corresponding tables. If e.g. a new column is added to the database, this column is added manually to the GUI application.

  2. Dataflow: The purpose of the Dataflow Team is to do do data calculations and prepare the data for the reporting team. This is done via multiple levels. Let me try to explain the process a little bit more in detail: The Dataflow Team uses the data from the Entry database copied to another server and another database via Transactional-Replication (this data contains information from all clients). Then once per hour a self-written application is checking for changed rows in the input tables (using a ChangedDate column) and then calling a stored procedure for each output table calculating new data using 1-N of the input tables. After that the data is copied to another database on another server using again Transaction-Replication. Here another stored procedure is called to calclulate additional new output tables. This stored procedure is started using a SQL job. From there the data is split to different databases, each database being client specific. This copying is done using another self-written application using the .NET bulkcopy command (filtering on the client). These client specific databases are copied to different client specific reporting databases on other servers via another self-written application which compares the reporting database with the client specific database to calculate the data difference. Just the data differences are copied (because the reporting database run in former times on the client servers). This whole process is orchestrated by another self-written application to control e.g. if the Transactional-Replications are finished before starting the job to call the Stored procedure etc... Futhermore also the synchronisation between the different clients is orchestrated here. The process can be graphically displayed by a self-written monitoring tool which looks pretty complex as you can imagine... The status of all this components is logged and can be viewed by another self-written application. If new columns or tables are added all this components have to be manually changed. For deployment installation instructions are written using MS Word. (ca. 10 people working in this team)

  3. Reporting: The Reporting Team created it's own platform written in .NET to allow the client to create custom reports via a GUI. The reports are accessible via the Web.

The biggest tables have around 1 million rows. So, I hope I didn't forget anything important.

Well, what I want to discuss is how other people realize this scenario, I can't imagine that every company writes it's own custom applications. What are actually the possibilities to allow fast calculations on databases (next to using T-SQL). I'm somehow missing the link here to the object oriented programming I'm used to from my old company, but we never dealt with so much data and maybe for fast calculations this is the way to do it...Or is it possible using e.g. LINQ or BizTalk Server to create the algorithms and calculations, maybe even in a graphical way? The question is just how to convert the existing meter-long Stored procedures into the new format... In future we want to use data warehousing, but that will take a while, so maybe it's possible to have a separate step to streamline the process.

Any comments are appreciated.

Thanks Daniel

A: 

From the sounds of what you are saying, you have a three step process.

  1. Input data
  2. Analyze data
  3. Report data

Steps one and three need to be completed by "users". Therefore, a GUI is needed for each respective team to do the task at hand, otherwise, they would be directly working on SQL Server, and would require extensive SQL knowledge. For these items, I do not see any issue with the approach your organization is taking, you are building a customized system to report on the data at hand. The only item that might be worth considering on these side, is standardization between the teams on common libraries and the technologies used.

Your middle step does seem to be a bit lengthy, with many moving parts. However, I've worked on a number of large reporting systems where that is truly the only way to get around it. WIthout knowing more of your organization and the exact nature of operations.

Mitchel Sellers
+1  A: 

Why on earth would you want to convert existing working complex stored procs (which can be performance tuned) to LINQ (or am I misunderstanding you)? Because you personally don't like t-sql? Not a good enough reason. Are they too slow? Then they can be tuned (which is something you really don't want to try to do in LINQ). It is possible the process can be made better using SSIS, but as complex as SSIS is and the amount of time a rewrite of the process would take, I'm not sure you really would gain anything by doing so.

"I'm somehow missing the link here to the object oriented programming..." Relational databases are NOT Object-oriented and cannot perform well if you try to treat them like they are. Learn to think in terms of sets not objects when accessing databases. You are coming from the mindset of one user at a time inserting one record at a time, but this is not the mindset neeeded to deal with the transfer of large amounts of data. For these types of things, using the database to handle the problem is better than doing things in an object-oriented manner. Once you have a large amount of data and lots of reporting, people are far more interested in performance than you may have been used to in the past when you used some tools that might not be so good for performance. Whether you like T-SQL or not, it is SQL Server's native language and the database is optimized for it's use.

HLGEM
+1 Don't count on any transferable concepts from OO (or any other form of procedural) programming to SQL. You need to learn SQL entirely on its own terms. And trying to mix in abstraction complexity with LINQ, SSIS, etc. without knowing SQL (which after all they must all resolve to anyway) will just mess things up.
le dorfier
+1  A: 

I'm currently re-engineering a complex system which is moving from Focus (a database and language) to a data warehouse (separate team) and processing (my team) and reporting (separate team).

The current process is combined - data is loaded and managed in the Focus language and Focus database(s) and then reported (and historical data is retained)

In the new process, the DW is loaded and then our process begins. Our processes are completely coded in SQL, and a million row fact table (for one month) would be relatively small. We have some feeds where the monthly data is 25 million rows. There are some statistics tables produced which are over 200 million rows (a month). The processing can take several hours a month, end to end. We use tables to store intermediate results, and we ensure indexing strategies are suitable for the processing. Except for one piece implemented as an SSIS flow from the database back to itself because of extremely poor scalar UDF performance, the entire system is implemented as a series of T-SQl SPs.

We also have a process monitoring system similar to what you are discussing as well as having the dependencies in a table which ensures that each process runs only if all its prerequisites are satisfied. I've recently grafted on the MSAGL to graphically display and interact with the process (previously I was using graphviz to generate static images) from a .NET Windows application. The new system thus has much clearer dependency information as well as good information about process performance so effort can be concentrated on the slowest performing bottlenecks.

I would not plan on doing any re-engineering of any complex system without a clear strategy, a good inventory of the existing system and a large budget for time and money.

Cade Roux
+2  A: 

The best advice, having been here before, is to start by learning first how SQL works, and doing it in the context of the existing architecture sounds like a good way to start (since nothing you've described sounds irrational on the face of it.)

Whatever abstractions you try to lay on top (LINQ, Biztalk, whatever) all eventually resolve to pure SQL. And almost always they add overhead and complexity.

Your OO paradigms aren't transferable. Any suggestions about abstractions will need to be firmly defensible based on your firm grasp of the SQL consequences.

It will take a while, but it's all worth knowing, both professionally and personally.

le dorfier
A: 

By "fast calculations" you must mean "fast retrieval" Data warehouses (both relational and otherwise) are fast with math because the answers are pre-calculated in advance. SQL, unless you are using CLR stored procedures, is usually a rather slow when it comes to math.

MatthewMartin
A: 

You'd be hard pressed to defeat the performance of BCP and SQL with anything else. If the update routines are long and bloated because they loop through the tables, then sure I can see why you'd want to go to .NET. But you'd probably increase performance by figuring out how to rewrite them all nice and SET based. BCP is not going to be able to be beaten. When I used SQL Server 2000 BCP was often faster than DTS. And SSIS in general (due to all the data type checking) seems to be way slower than DTS. If you kill performance no doubt people are going to be coming to you. Still if you are doing a ton of row by row complex calculations, optimizing that into a CLR stored procedure or even a .NET application that is called from SQL Server to do the processing will probably result in a speed up. Of course if you were row processing and you manage to rewrite the queries to do set processing you'd probably get a bigger speed up. But depending upon how complex the calculations are .NET may help.

Now if a front end change could immediately update and propagate the data, then you might want to change things to .NET so that as soon as a row is changed it can be recalculated and update all the clients. However if a lot of rows are changed or the database is just ginormous then you will kill performance. If the operation needs to be done in bulk then probably the way it is currently being done is the best.

The only thing I might as is that maybe there is a lot of duplicate SQL that looks exactly the same except for a table name and or the column names. If so, you can probably use .NET combined with SQL-SMO(or DMO if using SQL Server 2000) to code generate it.

Here's an example that I often see to load a datawarehouse

Assuming some row tables are loaded with the data from the source

select changed rows from source into temporary tables
see if any columns that matter were changed
if so terminate existing row (or clone it into some history table)
insert/update new row

I often see one of those queries per table and the only variations are the table/column names and maybe references to the key column. You can easily get the column definitions and key definitions out of SQL Server and then make a .NET program to create the INSERT/SELECT/ETC. In the worst case you may just have to store some type of table with TABLE_NAME, COLUMN_NAME for the columns that matter. Then instead of having to wrap your head around a complex ETL process and 20 or 200 update queries, you just need to wrap your head around UPDATE and one query. Any changes to the way things are done can be done once and applied to all the queries.

In particular my guess is that you can apply this technique to the individual client databases if you haven't already. Probably all the queries/bulk copy scripts are the same or almost the same with the exception of database/server name. So you can just autogenerate them based on a CLIENTs table or something.....

Cervo