How do I replicate and validate data from remote systems into SQL Server

I have a task at hand of creating some kind of logic for a database. There is a IBM MQ Series setup in place today that replicates data from a couple of remote systems. At the moment it changes in the remote system and dumps the data into a staging table in a SQL Server 2005 database.

I want to validate the data according to some simple validation logic (required fields and data format) and some more advanced checks of the data hierarchy. The data being imported is hierarchical with a parent child relationship between some of the data. The hierarchical validation should check both data currently available in the staging table and the production table to make sure that there will be a valid hierarchy after the import completes.

If a validation of a record fails, the failure should be logged somehow and the failing records should not be imported, but the remaining records should. Records in the current production tables should be replaced by new records if they have the same id. The transfer of the data from the staging table should be made as soon as possible after that the data appears in the staging table.

As far as I know now, the total number of records in the production tables will not exceed 1 million and probable number of updated items per batch would be at the most a couple of thousand lines.

My question is what solution is best fit for importing the data?

I have thought of a couple of possible solutions:

Means of starting the transfer:

Windows service that polls the staging table at a regular interval and kicks off some kind of transfer process whenever new data is inserted.
Make the MQ kick off the transfer process upon inserting new data into the table
Scheduling an SSIS job to run at a regular interval

Means of validating and transferring the data:

Creating a SSIS job
Creating Stored procedures
Custom .NET code

My main concerns are that the hierarchy must remain intact in the production tables at all times and the data should be available in the production table shortly after appearing in the staging table. I cannot be sure that a full hierarchy is available at all times in the staging table.

We use a closed process for this type of work.

Get data from Remote system into Staging tables, Import into Product tables.

If anything can insert into Staging Tables at-will, and there are Child Tables, then there is risk that you might Import into the Production DB before all the Child Records are created in the staging tables.

If you are free to add columns to the Staging Tables (which the remote end will ignore), or if the staging tables have an IDENTITY or GUID that is unique / non-repeating, then you can create a parallel table.

Ideally the routine creating rows in the Staging Table will use a Batch number, and then create a "Batch Number Done" record when successful. So you have a semaphore to stop you importing until all associated records are in.

(They could be inserted in a Transaction Block, but you would have to be confident that all processes inserting in Staging Table honoured that).

Given the IDENTITY / GUID I would create a 1:1 "Error table" to store any messages describing import failure.

You may choose to move, or copy, failed rows to a separate failures staging table, so that the main staging table doesn't get clogged up and it is easier to make sure that the failures are being sorted out (by a human I presume).

Having said that here is a more detailed description of our specific processes:

To minimise bandwidth and updates (i.e. reduce blocking and minimise unnecessary transaction log entries etc.) we do the following:

On Source machine hold a copy of the table being transferred. This has additional columns for Updated-on and Action (Update or Delete flag - Update includes Insert, and Insert may have been Updated again before destination ever gets that row ...)

Inserts to this table, of new, rows are OK

Updates to this table only happen if there is a difference in at least one column - so its OK (if rather CPU intensive) to compare the whole source table against the staging table to find out what has changed. Updates set the Updated-on column. (Beware of when the clocks go back)

Periodically we flag the stagin table rows Action=Deleted if they cannot be found in the Source table.

The data is copied from Source to identical tables on Destination where the Updated-on is after the last transfer.

On the Destination server the routine that checks the data, and imports it to Production, works solely on the Updated-on date (process everything since last Updated-on)

All staging tables are in a separate database which has minimal logging - the database will automatically refresh from "last updated date" if it is restored, so our aim is to minimise transaction log. We don;t store the staging tables in the Production database (which will have full transaction logging) in order to avoid bloating the TLogs

Again, it is helpful if these processes are sequential so that they cannot happen concurrently, otherwise some sort of Batch No is needed to prevent transferring partially completed batches where Parent/Child tables exist.

We transfer the data in the style of the Source database, and only make data manipulations at the Destination end. The reason for this is that if the data manipulation is wrong, and we fix it, we only have to rerun it at the Destination end - the Source data is all present in the staging tables at the Destination end. If we did the manipulation at the Source end we would have to retransmit all the data again, as well.

For debugging having the Updated-on date in the Staging table is helpful. We often times get "Why is this different" and seeing the Updated-on tells us if the manipulation on the Destination end is at fault (i.e. we have recent data showing the expected Source data) or the Source end (no recent data found!)

ansaurus

tags:

views:

answers:

How do I replicate and validate data from remote systems into SQL Server

related questions