views:

31

answers:

1

We have several legacy & 3'd-party systems in organization that use several RDBMS vendors (& more specific data storages). Cross-system data reporting (as well as extra-reports that are not implemented in 3'd-party systems) is required with charts and population of templates (winword, excel). Reporting system is visioned as intranet web-site with custom user access to reports. We expect ~50 reports per day.

Would you suggest to use BizTalk or any other integration software if commercial-department doesn't plan to buy anything expensive.

Would you suggest to create centralized data storage for reporting that is populated regularly or rely on on-demand services that provides always up-to-request data. Centralized data storage will bring the ability to use standard tools such as MSSQL Reporting Services but templated reports would be custom coded with light-weight solution (as I suspect)

Thank you in advance!

A: 

To pick an ideal architecture, you'll need to examine some of the dynamics of your system. A few relevant questions:

  • How often does the source data change or update?
  • How "fresh" and real-time does the data have to be in the reports?
  • How often do you suspect that the source systems may change in the future?
  • How different are the source data structures from each other?
  • May there be other consumers besides the reporting system in the future?
  • In addition to schematic differences, are there semantic heterogeneities in the data?
  • How complex are the schemas?

With that in mind, let's examine the pros and cons of two data aggregation approaches:

Central Data Warehouse

  • Easy uniform schema for the reporting system and other consumers.
  • Hub-and-spoke topology means only one connector per source is required. If the source changes, there is only one place you need to fix the connection.
  • Data may not be fresh, as it relies on periodic synchronization with end systems.
  • If your data warehouse schema does not cover some future need, the hub-and-spoke topology means that you have to replace all the source system connectors.
  • The schema is rigidly defined, but an extensive system of validators is needed to enforce semantics.
  • You have an opportunity to perform data cleansing in one spot, correcting certain classes of dirty data known to you.

Point-to-Point Custom Connectors

  • As close to real-time data as possible.
  • All connectors are isolated from each other, and if a source changes then you need to change only one connector.
  • The uniformity of both schema and semantics may be implied in your connectors, but may not be rigidly enforced to the degree that a common database target would imply.
  • Changes to your reporting system or the addition of a new target may require you to rework all connectors.
  • The reporting system has to assume responsibility for any data cleansing necessary.
  • An ESB (e.g. Biztalk) may be a nice way to manage these connectors if they are message-oriented. It will add some overhead and expense, but you'll get reliability and a central broker to help you out. Depending on the size and expected growth of this aggregated system, and ESB may or may not represent a net reduction in complexity.

In both cases, I think the construction of the connectors can be accomplished with commercial products, open source products, or plain old code. There may be some extra bells and whistles (which may improve productivity) as you start to pay for products, but the major expense will be your engineers' time (both coding and analyzing). I would suggest:

  • If you are already fluent with a given tool for the connectors (e.g. ETL), go ahead and consider it. In particular, it'll cut a lot of the boilerplate.
  • If you don't have experience with these systems, think twice - you're masking the code in tools that might confuse more than help for a one-off project. But cutting the boilerplate and utilizing the structure forced on your may be a good thing over the long run.
  • Consider that the requirements WILL change someday. Make sure you've picked a technology that can be easily adapted and maintained.

There is of course no One Answer, but hopefully this helps you examine the right questions. I think the keys are managing complexity, and realizing that the overall aggregation network will change someday. It's just a matter of when.

Greg Harman
Thank you, lost of things to think about. What kind of ETL tools can you suggest? It there industry standard for this tools?
Andrew Florko
Greg Harman
Thank you once again
Andrew Florko