views:

467

answers:

4

Is there any way to get a fault tolerant MySQL replication? I am in an environment that has many networking issues. It appears that replication gets an error and just stops. I need it to continue to work and recover from these faults. There is some wrapper software that checks the state of replication and restarts it in the case of losing its log position. Is there an alternative?

Note: Replication is done from an embedded computer with MySQL 4.1 to a external computer that has MySQL 5.0.45

A: 

Consider MySQL Cluster using the NDB storage engine, it's meant to be shared-nothing and fault tolerant

ʞɔıu
Need to stay with MyISAM or InnoDB.
Joshua
+1  A: 

What error are you getting? You also haven't described what replication scheme or Mysql version you're using. The errors you're getting are also important.

Replication usually stops when there's a primary/unique key conflict in a Master-Master replication. Other than that on a typical Master-Slave replication setup, networking issues shouldn't cause problems.

Try using Mysql 5.1 or newer, since replication in 5.0 is statement-based and causes problems in Master-Master setups, or when you're using stored-procedures.

(Also, stay away from Mysql Cluster ... noticed the advice on another comment).

Alexandru Nedelcu
Only used in a Master - Slave environment
Joshua
+1  A: 

MySQL replication will normally detect problems and reconnect anyway, continuing from where it left off.

If you're getting replication errors, it's likely that the source is something else. MySQL replication effectively does a "tail -f" on the query log and replays it on the slave (it's slightly smarter than that, but not much).

If the databases become out of sync, MySQL replication will neither detect nor repair this, but it may eventually cause it to break as a subsequent update cannot proceed due to conflicting data on the slave.

The default timeouts on the replication slave are much too long - it waits hours (or something) - you'll want to reduce this.

Data becoming out of sync is difficult to avoid, mitigation steps are:

  • Monitor replication using something like mk-table-checksum from Maatkit
  • Audit all your code for replication-unsafe queries
  • If using 5.1, switch to row-based replication, which is less likely to suffer from this problem
MarkR
+1  A: 

Replication errors only happen if the databases get out of sync somehow, having the server simply continue would mean incoherent databases, I really doubt you'd want that.

In my experience, the only time you end up with such errors is if one of the master servers did not complete a query and the slave noticed.

In any case, if you really want to have the slave continue via some sort of chron job, you could always have a query run every few minuts asking the slave "SHOW SLAVE STATUS" then checking the error column, if it's present, send a "STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER; START SLAVE;" command. But it would probably be much more apt to send an email to an admin when mysql encounters an error instead, so he/she can investigate the source of the problem and make sure the databases are actually in sync, otherwise you're likely to see more errors in the near future as the databases become more and more out of sync.

Ian
This is an embedded environment that needs to just repair itself.
Joshua