views:

152

answers:

2

I want to gather data from different data servers located in Europe and Asia. Rather than running a plain data query task that will clog up the undersea network I am thinking of a couple of machines that will be available for me at the local sites.

I am thinking to design the master package so that I can:

  1. run remote setup tasks
  2. launch the data collection package locally using psexec dtexec ...
  3. get the data locally stored in multiple raw file (1 for each type of data)
  4. zipped and pulled back
  5. unzipped and bulkuploaded to local server

Data collection is handled through custom script source since the data is available through a weird class library.

Tasks can fail unpredictably. If a particular type of data is successfully captured while the others fail for a particular location, I don't want to run it again.

How can I simplify this design if possible and make it more robust?

A: 

I probably would avoid creating a master package that does it for all locations. Instead, create a configurable package that does these steps for one location (with SSIS variables specifying the location-specific properties).

You can now run this package either from .cmd script, or if you want to, create master SSIS package with multiple Execute Process tasks, each starting the first package with appropriate variable values.

P.S. Yes, in the master package you should use Execute Process task that starts DTEXEC, not Execute Package task - unfortunately, Execute Package is not very configurable - see http://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=295885.

Michael
+1  A: 

Extraction over a slow or expensive WAN link

I think what you describe sounds appropriate. For a slow or expensive WAN link you would want to reduce the amount of data transfer. Some approaches to this are:

  • Changed data capture.
  • Compression.

If you can easily identify new transactions or changed data at source you can reduce the data volumes by only sending the changes. If you have resources at the source but cannot easily identify changed data you can do something like this (build a generic framework for this if you need to):

  • Extract from the source
  • Calculate a hash value using an algorithm with a low probability of a birthday attach (e.g. MD5, SHA-1)
  • Maintain a database or file with the hash values in the form (source system key, hash value of all non-key attributes)
  • Bundle up anything with an unmatched hash value and send it over the WAN link.
  • Update the database of hashes.

Robust Distributed Extraction

There are many failure modes for a distributed system like this, so you will need to implemenent a reasonably robust error handling mechanism for this. Examples of failure modes might be:

  • One of the source systems or network connections goes down, possibly on a scheduled basis.
  • One of the data packages is late arriving
  • Data is corrupted somehow.
  • Transient loads or other issues cause timeouts so a transfer must be chunked.

Depending on the requirements for the warehouse system, you may need to tolerate failure of individual feeds. You will need to design a robust error handling strategy for this.

Merge-on-Extract vs. Merge-on-Transform

If the systems are identical (e.g. POS systems in a chain of retail shops) you will probably get a simpler architecture by merging the data before the transformation phase. This means that the staging area will need to be aware of the data source, at least for audit purposes.

If you have a small number of systems or several heterogeneous sources data merging should take place during the transformation process. In this situation your ETL will probably have separate routines for each of the source systems for at least some of the process.

Do we need an ODS?

One of the great religious wars in data warehousing is whether one should have an ODS. I've done systems with and without ODS structures and in the individual cases there were reasons for the design decisions. My take on this is that there's no universal compelling argument on either side of this decision, which is the usual reason for the existence of religious wars in the first place.

My take on this for a 50,000ft view is that the more source systems and the more homogeneous the data is, the stronger the case for an ODS. One can draw a gartner-style quadrant for this:

High+--------------------------+--------------------------+
    |                          |                          |
    | Kimball Model (low/high) | Enterprise Data Warehouse|
H   | Unified ODS model hard   |        (high/high)       |
e   | to meaningfully design.  | ODS both difficult and   |
t   | Flat star schemas easier | probably necessary to    |
e   | to fit disparate         | make a manageable system |
r   | data into.               | Better to separate trans-|
g   |                          | formation ahd history.   |
e   +--------------------------+--------------------------+
n   |                          |                          |
e   |                          |  Consolidated Reporting  |
a   |   Data Mart (low/low)    |       (high/low)         |
i   |                          |  ODS easy to implement   |
t   |   ODS probably of        |  and will simplify the   |
y   |   little benefit         |  overall system          |
    |                          |  architecture            |
    |                          |                          |
Low +--------------------------+--------------------------+
Low                  Number of data sources            High
ConcernedOfTunbridgeWells