+4  A: 

This is definitely not a comprehensive reply. Sorry, for my long post and I hope it adds to thoughts that would be presented here.

I have a few observations on some of the aspect that you mentioned.

duplicate data

It has been my experience that this is usually a side effect of departmentalization or specialization. A department pioneers collection of certain data that is seen as useful by other specialized groups. Since they don't have unique access to this data as it is intermingled with other data collection, in order to utilize it, they too start collecting / storing the data, inherently making it duplicate. This issue never goes away and just like there is a continuos effort in refactoring code and removing duplication, there is a need to continuously bring duplicate data for centralized access, storage and modification.

well-defined interfaces

Most interfaces are defined with good intention keeping other constraints in mind. However, we simply have a habit of growing out of the constraints placed by previously defined interfaces. Again a case for continuos refactoring.

tight coupling vs loose coupling

If any thing, most software is plagued by this issue. The tight coupling is usually a result of expedient solution given the constraint of time we face. Loose coupling incurs a certain degree of complexity which we dislike when we want to get things done. The web services mantra has been going rounds for a number of years and I am yet to see a good example of solution that completely alleviates the point

architectural simplification

To me this is the key to fighting all the issues you have mentioned in your question. SIP vs H.323 VoIP story comes into my mind. SIP is very simplified, easy to build while H.323 like a typical telecom standard tried to envisage every issue on the planet about VoIP and provide a solution for it. End result, SIP grew much more quickly. It is a pain to be H.323 compliant solution. In fact, H.323 compliance is a mega buck industry.

On a few architectural fads that I have grown up to.

Over years, I have started to like REST architecture for it's simplicity. It provides a simple unique access to data and easy to build applications around it. I have seen enterprise solution suffer more from duplication, isolation and access of data than any other issue like performance etc. REST to me provides a panacea to some of those ills.

pyfunc
It absolutely adds thoughts to the discussion. Thanks, pyfunc. How would you go about refactoring to reduce data duplication across systems? How would you refactor interfaces across which data is shared to make them more well-defined?
jlpp
+5  A: 

I'm sure you saw this coming, "It Depends".

It depends on everything. And the solution to sharing Customer data for department A may be completely different for sharing Customer data with department B.

My favorite concept that has risen up over the years is the concept of "Eventual Consistency". The term came from Amazon talking about distributed systems.

The premise is that while the state of data across a distributed enterprise may not be perfectly consistent now, it "eventually" will be.

For example, when a customer record gets updated on system A, system B's customer data is now stale and not matching. But, "eventually", the record from A will be sent to B through some process. So, eventually, the two instances will match.

When you work with a single system, you don't have "EC", rather you have instant updates, a single "source of truth", and, typically, a locking mechanism to handle race conditions and conflicts.

The more able your operations are able to work with "EC" data, the easier it is to separate these systems. A simple example is a Data Warehouse used by sales. They use the DW to run their daily reports, but they don't run their reports until the early morning, and they always look at "yesterdays" (or earlier) data. So there's no real time need for the DW to be perfectly consistent with the daily operations system. It's perfectly acceptable for a process to run at, say, close of business and move over the days transactions and activities en masse in a large, single update operation.

You can see how this requirement can solve a lot of issues. There's no contention for the transactional data, no worries that some reports data is going to change in the middle of accumulating the statistic because the report made two separate queries to the live database. No need to for the high detail chatter to suck up network and cpu processing, etc. during the day.

Now, that's an extreme, simplified, and very coarse example of EC.

But consider a large system like Google. As a consumer of Search, we have no idea when or how long it takes for a search result that Google harvests to how up on a search page. 1ms? 1s? 10s? 10hrs? It's easy to imaging how if you're hitting Googles West Coast servers, you may very well get a different search result than if you hit their East Coast servers. At no point are these two instances completely consistent. But by large measure, they are mostly consistent. And for their use case, their consumers aren't really affected by the lag and delay.

Consider email. A wants to send message to B, but in the process the message is routed through system C, D, and E. Each system accepts the message, assume complete responsibility for it, and then hands it off to another. The sender sees the email go on its way. The receiver doesn't really miss it because they don't necessarily know its coming. So, there is a big window of time that it can take for that message to move through the system without anyone concerned knowing or caring about how fast it is.

On the other hand, A could have been on the phone with B. "I just sent it, did you get it yet? Now? Now? Get it now?"

Thus, there is some kind of underlying, implied level of performance and response. In the end, "eventually", A's outbox matches B inbox.

These delays, the acceptance of stale data, whether its a day old or 1-5s old, are what control the ultimate coupling of your systems. The looser this requirement, the looser the coupling, and the more flexibility you have at your disposal in terms of design.

This is true down to the cores in your CPU. Modern, multi core, multi-threaded applications running on the same system, can have different views of the "same" data, only microseconds out of date. If your code can work correctly with data potentially inconsistent with each other, then happy day, it zips along. If not you need to pay special attention to ensure your data is completely consistent, using techniques like volatile memory qualifies, or locking constructs, etc. All of which, in their way, cost performance.

So, this is the base consideration. All of the other decisions start here. Answering this can tell you how to partition applications across machines, what resources are shared, and how they are shared. What protocols and techniques are available to move the data, and how much it will cost in terms of processing to perform the transfer. Replication, load balancing, data shares, etc. etc. All based on this concept.

Edit, in response to first comment.

Correct, exactly. The game here, for example, if B can't change customer data, then what is the harm with changed customer data? Can you "risk" it being out of date for a short time? Perhaps your customer data comes in slowly enough that you can replicate it from A to B immediately. Say the change is put on a queue that, because of low volume, gets picked up readily (< 1s), but even still it would be "out of transaction" with the original change, and so there's a small window where A would have data that B does not.

Now the mind really starts spinning. What happens during that 1s of "lag", whats the worst possible scenario. And can you engineer around it? If you can engineer around a 1s lag, you may be able to engineer around a 5s, 1m, or even longer lag. How much of the customer data do you actually use on B? Maybe B is a system designed to facilitate order picking from inventory. Hard to imagine anything more being necessary than simply a Customer ID and perhaps a name. Just something to grossly identify who the order is for while it's being assembled.

The picking system doesn't necessarily need to print out all of the customer information until the very end of the picking process, and by then the order may have moved on to another system that perhaps is more current with, especially, shipping information, so in the end the picking system doesn't need hardly any customer data at all. In fact, you could EMBED and denormalize the customer information within the picking order, so there's no need or expectation of synchronizing later. As long as the Customer ID is correct (which will never change anyway) and the name (which changes so rarely it's not worth discussing), that's the only real reference you need, and all of your pick slips are perfectly accurate at the time of creation.

The trick is the mindset, of breaking the systems up and focusing on the essential data that's necessary for the task. Data you don't need doesn't need to be replicated or synchronized. Folks chafe at things like denormalization and data reduction, especially when they're from the relational data modeling world. And with good reason, it should be considered with caution. But once you go distributed, you have implicitly denormalized. Heck, you're copying it wholesale now. So, you may as well be smarter about it.

All this can mitigated through solid procedures and thorough understanding of workflow. Identify the risks and work up policy and procedures to handle them.

But the hard part is breaking the chain to the central DB at the beginning, and instructing folks that they can't "have it all" like they may expect when you have a single, central, perfect store of information.

Will Hartung
Thanks Will. This is a very interesting and appropriate perspective to have when thinking about data sharing. Let's say Department B needs to read Department A's customer data in near real-time. Could link the two databases with a DB link and query across it (slow joins). Another option is a DB link and creating a copy of the required customer data in Department B's database, an auto-syncing materialized view for example. The first option has higher coupling but is real time. The second option is slightly less coupled but introduces data redundancy. How to weigh these criteria? Other options?
jlpp
+1  A: 

To solve a number of those issues, I like the concept of central "Data Hubs". A Data Hub represents a "single source of truth" for a particular entity, but only stores IDs, no information like names etc. In fact, it only stores ID maps - for example, these map the Customer ID in system A, to the Client Number from system B, and to the Customer Number in system C. Interfaces between the systems use the hub to know how to relate information in one system to the other.

It's like a central translation; instead of having to write specific code for mapping from A->B, A->C, and B->C, with its attendance exponential increase as you add more systems, you only need to convert to/from the hub: A->Hub, B->Hub, C->Hub, D->Hub, etc.

Jeffrey Kemp
Very interesting solution. The term "data hub" makes me think of Master Data Management, which may be of some use in sharing data, especially if the organization has data quality/sync issues. That would normally come up if multiple systems update (as opposed to just reading) the same type of information, each in their own database, and then this information has to sync. What mechanism would you generally use to read your data and some minimal business data from another database, using the mapping hub data? DB links?
jlpp
The mechanism will vary from place to place, I think. Personally I always say "the more you can get it all on one database instance, the better", but in practice you end up having to set up interfaces via database links, via a middle-tier solution of some sort (e.g. web services), or by replication.
Jeffrey Kemp