views:

322

answers:

1

There is frequently the need to synchronize data from master tables in one database to clone tables in other databases, often on other servers. For example, consider the case where a backend system manages inventory data and that inventory data ultimately must be pushed to one or more databases that are part of a web site application.

The source data in the backend system is heavily normalized, with dozens of tables and foreign key constraints. It is a well-designed OLTP RDBMS system. Many of the tables in question contain millions of rows. The need is to push this data out to the other databases regularly. As frequently as feasible; latency can be tolerated. Above all, maximum uptime of both the backend and remote databases is imperative.

I am using SQL Server and am familiar with change tracking, rowversion, triggers, and so on. I know that Microsoft pushes replication, SyncFx, and SSIS heavily for these scenarios. However, there is quite a difference between vendor whitepapers and overviews recommending technologies and the actual implementation, deployment, and maintenance of the solution. In the SQL Server world, replication is often viewed as the turnkey solution, but I am trying to explore alternate solutions. (There is some fear that replication is difficult to administer, makes it hard to change schema, and in the event that a re-initialize is ever required there would be large downtime for critical systems.)

There are lots of gotchas. Due to the complex foreign key relationships among large numbers of tables, determining what order to perform captures or to apply updates is not trivial. Due to unique indexes, two rows might be interlocked in such a way that row-at-a-time update will not even work (need to perform intermediate updates to each row before the final update). These are not necessarily show-stoppers, as unique indexes can often be changed to regular indexes and foreign keys can be disabled (though disabling the foreign keys is extremely undesirable). Often, you will hear, "just" use SQL 2008 change tracking and SSIS or SyncFx. These kinds of answers really do not do justice to the practical difficulties. (And of course, clients really have a hard time wrapping their heads over how copying data could be so difficult, making a difficult situation all the worse!)

This issue is ultimately very generic: perform one-way synchronization of many heavily related database tables with lots of rows. Almost everyone involved in databases has to deal with this kind of issue. Whitepapers are common, practical expertise hard to find. We know this can be a difficult issue, but the job must get done. Let's hear about what has worked for you (and what to avoid). Tell your experience with Microsoft products or products from other vendors. But if you personally have not battle-tested the solution with large numbers of heavily-related tables and rows, please refrain from answering. Let's keep this practical -- not theoretical.

+3  A: 

Better ask on serverfault.com (I can't post comments, scripts are broken in SO, so I have to post a full answer)

Update: (switched to Safari, scripts work again, I can post properly)

There is no silver bullet. For ease of use and 'one key turn' deployment nothing can beat replication. Is the only solution that covers deeply conflict detection and resolution, has support for pushing schema changes and comes with a comprehensive set of tools for setting it up and monitoring it. It has been the MS poster child of data synchronization for many years before this 'agenda' was taken over by the .Net crowd. Replication has two underlying problems in my opinion:

  • The technology used to pushing changes is primitive, slow and unreliable. It requires file shares to initiate the replicas and it depends on T-SQL to actually replicate data, resulting in all sort of scalability problems: the replication threads use server worker threads and the fact that they interact with arbitrary tables and application queries lead to blocking and deadlocks. The biggest deployments I've heard of are around 400-500 sites and are done by superhuman MVPs and top dollar consultants. This stops on its track many projects that start at 1500 sites (way beyond largest deployed replication projects). I'm curious to hear if I'm wrong and you know of a SQL Server replication solution deployed with more than 500 sites.
  • The replication metaphor is too data centric. It does not take into account the requirements of distributed applications: need of versioned and formalized contracts, autonomy of data 'fiefdoms', loose coupling from availability and security pov. As a result replication based solution solve the immediate need to 'make data available there', but fail to solve the true problem of 'my app needs to talk with your app'.

At the other end of the spectrum you'll find solutions that truly address the problem of application communication, like services based on queued messaging. But are either painfully slow and riddled with problems rooted in the separation of the communication mechanism (web services and or msmq) and the data storage (DTC transactions between comm and db, no common high availability story, no common recoverability story etc etc). Solutions that are blazingly fast and fully integrated with DB exists in the MS stack, but nobody knows how to use them. Somewhere in between these and replication you'll find various intermediate solutions, like OCS/Synch framework and SSIS based custom solutions. None will offer the ease of setup and monitoring of replication, but they might scale and perform better.

I was involved with several projects that required 'data synchronization' on a very large scale (+1200 sites, +1600 sites) and my solution was to turn the problem on a 'application communication' problem. Once the mindset is changed to this and the data flow is no longer seen as 'record with key X of table Y' but instead 'message communicating the purchase of item X by customer Y' the solution becomes easier to understand and apply. You no longer think in terms of 'insert records in order X-Y-Z so FK relations don't break' but instead in terms of 'process purchase as described by message XYZ'.

In my view replication, and it derivatives (ie. data tracking and data-gram shipping), are solutions anchored in the '80 technologies and view of the data/applications. Obsolete dinosaurs (and by no way turning into birds).

I know this does not even begin to address all your (very legit) concerns, but writing out all I have to say/rant/rable on this topic would fill volumes of paperback...

Remus Rusanu
Thanks, but I am viewing this from the standpoint of a database developer, not a server administrator. This is important from an upfront software design standpoint, not only an operations matter.
binarycoder
Thanks for your insight. For reference, note that the number of target sites that I am particularly concerned with is very small (1-3 databases) compared to the projects you've done.The intention is to run identical software logic on each node, thus the database schema of the tables in question will be the same. I understand what you say about "application communication", which is a must when disparate systems are involved, but a more general-purpose solution that requires little code by taking advantage of the schemas being the same is what I'm looking for.
binarycoder
You are describing replication. If it fits your needs, with all its gothcas, don't sweat on reinventing it. There are literally years of experience and feedback already accumulated in 'out-of-the-box' replication. The gotchas you see are what's *left* after many more problems were fixed, and you'll just have to overcome all those yourself.
Remus Rusanu