views:

43

answers:

3

Hi,

I have 2 heterogeneous databases. One in mysql and one in ms sql.

I want to keep them in sync.

There will be flow of data periodically and information flow will be both ways

Anyone got any strategies / approaches to it ??

A: 

Anand, you can find this in Google.

sync mysql and mssql

I've not used this software, but they offer a free trial

Steve De Caux
yes checked that out, but i don't want a commercial solution.I want to develop one on my own, only in the worst case will I go for a commercial solution
Anand
cool bananas, I misunderstood your question
Steve De Caux
+1  A: 

Assuming you aren't going to use some form of ready made solution, you've got a few options open to you. Basically what you're trying to do is find a way to capture the changes made in one database and replicate them in the other database.

Full Extract and Delta
Take a complete, sorted by key, dump of every row in the table(s) you want to sync and compare it row-by-row against the dump from the last sync you ran. Having the output sorted makes the compare process a lot quicker, as you can figure out if a row has been changed, removed or deleted without

  • Pro: Guaranteed to capture all changes.
  • Pro: Simple to Implement.
  • Con: Slow.
  • Con: Will create a lot of load on the database when running.

This option should be quite viable for smaller or medium sized databases.

Transaction Logs
Analyze the transaction logs from the database in order to find out what changed, and apply those changes to the other database.

  • Pro: Faster, as it doesn't need to read everything out of the database
  • Pro: Still relatively simple to implement
  • Con: DBAs sometimes have to purge logs to resolve production issues. This can result in missed changes that don't get synced.

Possibly a good idea if you can count on the logs being available.

Triggers
Use triggers to record the changes, and replicate them to the other database.

  • Pro: Fast, as it only captures the changes.
  • Con: Adds overhead to every transaction

Synchronization in the Application
Simply make sure the application writes to both databases.

  • Pro: No real overhead for the database.
  • Con: Unreliable. All it takes is one person forgetting to write to both databases.

This could be made to work if the application writes to the database only through a few controlled modules (ie. there's not many places to forget to update both databases). In a less managed setup (ie. multiple applications / uncontrolled or poorly factored database access / ad-hoc scripts) this simply isn't an option.

Adam Luchjenbroers
Hi,Of the three suggestion I guess your third one makes a lot of sense to me. One more aspect of my problem is that the tables in both the dbs are different.
Anand
Well, there's not much of a way around that problem. You'll have to sit down and think through how data from one database maps onto the other database. There's no simple answer there, its all application specific.
Adam Luchjenbroers
Yes I agree that I have to do the mapping and that it is application specific. Thanks for the suggestions. You have put things in a better perspective for me
Anand
A: 

First more info is needed:

  • how much latency in the syncrhonization can you afford? 5 seconds? 5 hours? 15 hours? etc.
  • how much transformation between databases? Nothing or trivial changes that can be handled in a view, or something more substantial?
  • what do the data volumes look like? How large are the tables, how much data changes daily, and is the data partitioned?
  • what are the concurrency requirements? Can you shut off user access for a few minutes?
  • will bi-directional changes be occurring to the same table or to different tables?
  • what are the data quality requirements? is it ok if two different transactions, each to the same row in the same table but in different databases gets updated at about the same time - and one steps on the other?
  • what is the granularity? do you need to replicate each transaction that occurs or just the current snapshot at a point in time (even if that point in time is every 5 minutes)?

In general, if you need real-time synchronization then you end up with a replication solution. This can typically handle a very small amount of transformation (usually happens via stored procs). It is typically a commercial solution that sniffs logs. Since most people don't want to have a code dependency on log formats they almost always go with a packaged solution.

If you don't need real-time synchronization, have vast data volumes or have significant transformation requirements then you end up with an ETL solution. There are quite a few to choose from, but they are mostly commercial. On the other hand, they aren't difficult to develop yourself - if you take the time to understand best practices. Which oddly enough, really aren't talked about much. Anyhow, Adam Luchjenbroers did a good job identifying most of the approaches with ETL. I recommend the file delta approach if you can afford to loose transactions between snapshots - since it is otherwise the most accurate approach since all of the others rely on timestamps, triggers or logs which do not capture all changes.

KenFar
Hey dude thanks for this detailed explanationWell I can afford a latency of say 15 mins. How about having a job ping my table every 10min and see if it is updated based on the latest timestamp and then sort of carry out the sychronization ??
Anand
If your timestamp is indexed (or your tables are small), and the amount of data changes every 10 minutes is small enough then that should work. You might want to do something special in order to detect deletes - like use a trigger to copy the deleted row (or just pk) to a capture table. Loads will probably also be invisible to you - but they are less common.
KenFar