Extraction over a slow or expensive WAN link
I think what you describe sounds appropriate. For a slow or expensive WAN link you would want to reduce the amount of data transfer. Some approaches to this are:
- Changed data capture.
- Compression.
If you can easily identify new transactions or changed data at source you can reduce the data volumes by only sending the changes. If you have resources at the source but cannot easily identify changed data you can do something like this (build a generic framework for this if you need to):
- Extract from the source
- Calculate a hash value using an algorithm with a low probability of a birthday attach (e.g. MD5, SHA-1)
- Maintain a database or file with the hash values in the form (source system key, hash value of all non-key attributes)
- Bundle up anything with an unmatched hash value and send it over the WAN link.
- Update the database of hashes.
Robust Distributed Extraction
There are many failure modes for a distributed system like this, so you will need to implemenent a reasonably robust error handling mechanism for this. Examples of failure modes might be:
- One of the source systems or network connections goes down, possibly on a scheduled basis.
- One of the data packages is late arriving
- Data is corrupted somehow.
- Transient loads or other issues cause timeouts so a transfer must be chunked.
Depending on the requirements for the warehouse system, you may need to tolerate failure of individual feeds. You will need to design a robust error handling strategy for this.
Merge-on-Extract vs. Merge-on-Transform
If the systems are identical (e.g. POS systems in a chain of retail shops) you will probably get a simpler architecture by merging the data before the transformation phase. This means that the staging area will need to be aware of the data source, at least for audit purposes.
If you have a small number of systems or several heterogeneous sources data merging should take place during the transformation process. In this situation your ETL will probably have separate routines for each of the source systems for at least some of the process.
Do we need an ODS?
One of the great religious wars in data warehousing is whether one should have an ODS. I've done systems with and without ODS structures and in the individual cases there were reasons for the design decisions. My take on this is that there's no universal compelling argument on either side of this decision, which is the usual reason for the existence of religious wars in the first place.
My take on this for a 50,000ft view is that the more source systems and the more homogeneous the data is, the stronger the case for an ODS. One can draw a gartner-style quadrant for this:
High+--------------------------+--------------------------+
| | |
| Kimball Model (low/high) | Enterprise Data Warehouse|
H | Unified ODS model hard | (high/high) |
e | to meaningfully design. | ODS both difficult and |
t | Flat star schemas easier | probably necessary to |
e | to fit disparate | make a manageable system |
r | data into. | Better to separate trans-|
g | | formation ahd history. |
e +--------------------------+--------------------------+
n | | |
e | | Consolidated Reporting |
a | Data Mart (low/low) | (high/low) |
i | | ODS easy to implement |
t | ODS probably of | and will simplify the |
y | little benefit | overall system |
| | architecture |
| | |
Low +--------------------------+--------------------------+
Low Number of data sources High