views:

83

answers:

2

I have a project to create a dashboard that will connect to existing systems as well as create new features based on combining data from the existing systems. For example, the dashboard will be able to generate "orders" containing data merged from "members" (MS Access DB), "employees" (MySQL DB) and "products" (flat file), and there will also be new attributes particular to "orders."

At first I thought it would be most efficient to have my application connect to each of the systems separately and perform cross-vendor joins between the different databases. But then I thought that creating a centralized/redundant db (built with scripts pushing and pulling data between the systems) might also be useful because it would empower some semi-technical staff to use products like OOBase, which can only make a single connection.

Are there any other advantages to creating a centralized/redundant DB like the one I'm talking about? Or are multiple direct connections the best approach?

Thanks in advance for any tips.

+1  A: 

Be very, very careful to copy lots of data around. If you do, here are some important guidelines:

  1. Make sure that one system is defined as the master and no other system may tamper with the data.

  2. Always copy data from the master to the slaves.

  3. When you copy the data, use a checksum of some kind to make sure all data has been copied. Make sure you can handle "yesterday, the copy failed".

  4. If a slave must make a change, push the change to the master and then use the standard "update" path to merge it back to the slave. Avoid "save change on slave and update the master some time in the future".

Aaron Digulla
Hi, Aaron. Thanks for your post. Do you think I need to read up on Inman and Kimball and understand dimensions and star schemas and EAV tables, etc., before I do all that? The warehouse seems like a lot of work, and I don't understand why it has to be different from a nicely normalized RDBMS schema.
Tony
Since I have no idea what you're talking about: No, you don't *have* to. You just need to understand that there are reasons why systems are the way they are and that data can't protect itself, so you must figure a way to do that.
Aaron Digulla
+2  A: 

To give you are short answer: yes, you want a central data storage.

You don't want to run complex reports on your live database. As your live database will grow you will want to do some housekeeping and clean it up but keep the data for analysi.

You will also want the data to be aggregated so you could perform historical analysis.

For the data which comes from different sources some clean-up will be required. And you will probably need to know how to link your data together and there are quire a lot of things like that you will have to be aware of to do the job properly.

You might consider reading on data warehousing (wikipedia) and business intelligence (wikipedia).

If you want to have 'new features' added to this system you could also look up orchestration (wikipedia). It will allow you to link your heterogeneous business processes together.

All of these are quite specialized and complex disciplines on their own so you might want to have a specialist to consult you.

Ilya Kochetov
Thanks for the feedback and the links, Ilya. I'm beginning to think a data warehouse is the way to go. However, I have never needed a consultant to help me build an information system. Can you recommend any books or tutorials? Is "The Data Warehouse Toolkit" still the standard? I would be writing the ETL in Python.
Tony
@Tony: of course if you have time and resources to learn something yourself and are not afraid of experimenting it's better to do it instead of getting a consultant. Please be aware that DW IS different from a normal RDBMS and could be tricky to do right from the first time.The book you suggested IS good, however quite a lot of other books are not so caveat emptor
Ilya Kochetov
After further research, I'm starting to think that the traditional "warehouse" approach (as defined by Inmon and Kimball) is going to lead me down the wrong path. I don't need a bunch of dimensional modeling and non-volatile data to provide complicated historical analysis for my business groups. I simply need to create a new operational system that manages records with keys in different operational systems. And I'm still not sure whether that would require an ELT layer - but if it did, then some tables would be redundant/read-only and others would be CRUD. I don't think that's a warehouse.
Tony
If it is an operational system you're after DW is not the way to go... a central information repository would still be very useful so you would not have to rely on any application to integrate the data for you. Using an orchestration server to load the data from various sources into a centralized storage might be the answer
Ilya Kochetov
I also suggest an interim storage step. And I'd pull a lot of ideas from data warehousing (like how to run ETL, how to standardize, NOT dimensional modeling or versioning) as well as from the ODS "Operational Data Store" (like how to handle near real-time feeds with light translation, etc ). Once you understand each, then you can pull appropriate ideas from both. A hybrid of the two for purposes like this isn't uncommon. But you should get someone with the right skills to give you feedback on your design.
KenFar