views:

172

answers:

4

DBA (with only 2 years of google for training) has created a massive data management table (108 columns and growing) containing all neccessary attribute for any data flow in the system. Well call this table BFT for short.

Of these columns:
10 are for meta-data references.
15 are for data source and temporal tracking
1 instance of new/curr columns for textual data
10 instances of new/current/delta/ratio/range columns for multi-value numeric updates :totaling 50 columns.

Multi valued numeric updates usually only need 2-5 of the update groups.

Batches of 15K-1500K records are loaded into the BFT and processed by stored procs with logic to validate those records shuffle them off to permanent storage in about 30 other tables.

In most of the record loads, 50-70 of the columns are empty through out the entire process.

I am no database expert, but this model and process seems to smell a little, but I don't know enough to say why, and don't want to complain without being able to offer an alternative.

Given this very small insight to the data processing model, does anyone have thoughts or suggestions? Can the database (SQL Server) be trusted to handle records with mostly empty columns efficiently, or does processing in this manner wasted lots of cycles/memory,etc.

+1  A: 

Normalization is the keyword here. If you have so many NULL values, chances are high that you're wasting a lot of space. Normalizing the table should also make data integrity in this table easier to enforce.

Lucero
+1  A: 

I typically have multiple staging tables corresponding to the input loads. These may or may not correspond to the destination tables, but we don't do what you're talking about. If he doesn't like to have a lot of what are basically temporary work tables, they could be put into their own schema or even a separate database.

As far as the columns which are empty, if they aren't referenced in the particular query which is processing BFT it doesn't matter - HOWEVER, what will happen is that the indexing becomes much more crucial that the index chosen is a non-clustered covering index. When your BFT is used and a table scan or clustered index scan is chosen, the unused column have to be read and ignored or skipped, and this definitely seems to affect processing in my experience. Whereas with a non-clustered index scan or seek, less columns are read, and hopefully this doesn't include (m)any of the unused columns.

Cade Roux
There are staging tables that capture the raw input data from around 12 different formats. When these are resolved to meta data types then the raw staged data produces one or more BFT records, which are then evaluated for correctness prior making an actual application data (table,row,column) change.
gbegley
That seems like overkill, because you're bringing them into a unified table for processing, but the table doesn't really unify everything and everything probably has to be handled differently in each process out of that table.
Cade Roux
Agreed, but having everything in one table allows easy reporting of the System I/O. This used to be in about 5 different tables that were brought together with 'union all' to satisfy real time reporting needs. You are correct that the procs have a lot of "case when data_type='x' then ..."
gbegley
+3  A: 

Sounds like he reinvented BizTalk.

Chris McCall
+1  A: 

One thing that might make things a little more flexible (other than normalizing) could be to create one or more views or table functions to present the data. Particularly if the table is outside your control, these would enable you to filter the spurious crap out and grab only what you need from the table.

However, if you're going to be one of the people who will be working with (and frowning every time you have to crack open) that massive table, you might want to trump the DBA's "design" and normalize that beast, and maybe give the DBA the task of creating some views and/or table functions to help you out.

I currently work with a similar but not so huge table which has been around on our system for years and has had new fields and indices and constraints rather hastily tacked on Frankenstein-style. Unfortunately some other workgroups rely on the structure as gospel, so we've created such views and functions to enable us to "shape" the data the way we need it.

Darth Continent
Given the table's history of five tables being brought into one, I find "Frankenstein-style" an apt characterization of this beast. If literary metaphor could win the argument, "Frankenstein-style" would take the check. I hope you're at least a little sated with a vote up.
gbegley
Hehe thanks, I'll take what I can get! ;-)
Darth Continent