tags:

views:

260

answers:

3

Hello,

Here's our mission:

  • Receive files from clients. Each file contains anywhere from 1 to 1,000,000 records.
  • Records are loaded to a staging area and business-rule validation is applied.
  • Valid records are then pumped into an OLTP database in a batch fashion, with the following rules:
    • If record does not exist (we have a key, so this isn't an issue), create it.
    • If record exists, optionally update each database field. The decision is made based on one of 3 factors...I don't believe it's important what those factors are.

Our main problem is finding an efficient method of optionally updating the data at a field level. This is applicable across ~12 different database tables, with anywhere from 10 to 150 fields in each table (original DB design leaves much to be desired, but it is what it is).

Our first attempt has been to introduce a table that mirrors the staging environment (1 field in staging for each system field) and contains a masking flag. The value of the masking flag represents the 3 factors.

We've then put an UPDATE similar to...

UPDATE OLTPTable1 SET Field1 = CASE WHEN Mask.Field1 = 0 THEN Staging.Field1 WHEN Mask.Field1 = 1 THEN COALESCE( Staging.Field1 , OLTPTable1.Field1 ) WHEN Mask.Field1 = 2 THEN COALESCE( OLTPTable1.Field1 , Staging.Field1 ) ...

As you can imagine, the performance is rather horrendous.

Has anyone tackled a similar requirement?

We're a MS shop using a Windows Service to launch SSIS packages that handle the data processing. Unfortunately, we're pretty much novices at this stuff.

A: 

If you are using SQL Server 2008, look into the MERGE statement, this may be suitable for your Upsert needs here.

Can you use a Conditional Split for the input to send the rows to a different processing stage dependent upon the factor that is matched? Sounds like you may need to do this for each of the 12 tables but potentially you could do some of these in parallel.

revelator
A: 

I took a look at the merge tool, but I’m not sure it would allow for the flexibility to indicate which data source takes precedence based off of a predefined set of rules.

This function is critical to allow for a system that lets multiple members utilize the process that can have very different needs.

From what I have read the Merge function is more of a sorted union.

Paul
A: 

We do use an approach similar to what you describe in our product for external system inputs. (we handle a couple of hundred target tables with up to 240 columns) Like you describe, there's anywhere from 1 to a million or more rows.

Generally, we don't try to set up a single mass update, we try to handle one column's values at a time. Given that they're all a single type representing the same data element, the staging UPDATE statements are simple. We generally create scratch tables for mapping values and it's a simple

UPDATE target SET target.column = mapping.resultcolumn WHERE target.sourcecolumn = mapping.sourcecolumn.

Setting up the mappings is a little involved, but we again deal with one column at a time while doing that.

I don't know how you define 'horrendous'. For us, this process is done in batch mode, generally overnight, so absolute performance is almost never an issue.

EDIT: We also do these in configurable-size batches, so the working sets & COMMITs are never huge. Our default is 1000 rows in a batch, but some specific situations have benefited from up to 40 000 row batches. We also add indexes to the working data for specific tables.

DaveE