views:

451

answers:

2

I have two SQL Server 2005 instances that are geographically separated. Important databases are replicated from the primary location to the secondary using transactional replication.

I'm looking for a way that I can monitor this replication and be alerted immediately if it fails.

We've had occasions in the past where the network connection between the two instances has gone down for a period of time. Because replication couldn't occur and we didn't know, the transaction log blew out and filled the disk causing an outage on the primary database as well.

My google searching some time ago led to us monitoring the MSrepl_errors table and alerting when there were any entries but this simply doesn't work. The last time replication failed (last night hence the question), errors only hit that table when it was restarted.

Does anyone else monitor replication and how do you do it?


Just a little bit of extra information:

It seems that last night the problem was that the Log Reader Agent died and didn't start up again. I believe this agent is responsible for reading the transaction log and putting records in the distribution database so they can be replicated on the secondary site.

As this agent runs inside SQL Server, we can't simply make sure a process is running in Windows.

+1  A: 

We have emails sent to us for Merge Replication failures. I have not used Transactional Replication but I imagine you can set up similar alerts.

The easiest way is to set it up through Replication Monitor.

Go to Replication Monitor and select a particular publication. Then select the Warnings and Agents tab and then configure the particular alert you want to use. In our case it is Replication: Agent Failure.

For this alert, we have the Response set up to Execute a Job that sends an email. The job can also do some work to include details of what failed, etc.

This works well enough for alerting us to the problem so that we can fix it right away.

Jeff Widmer
Do you have any information on what actually constitutes a failure? Often when backups are running at night, the bandwidth prevents replication immediately. That's fine as long as it catches up so we don't need an alert. I guess it's a balancing act...
Damovisa
To be honest we have not figured this out completely either. There are times an alert will trigger but then eventually Replication will complete successfully. And mostly we have false alerts when our snapshots are getting generated (we do this weekly). When a particular merge agent runs and fails then the alert is true and makes sense (usually unable to connect to the publisher type problems).
Jeff Widmer
+1  A: 

You could run a regular check that data changes are taking place, though this could be complex depending on your application.

If you have some form of audit train table that is very regularly updated (i.e. our main product has a base audit table that lists all actions that result in data being updated or deleted) then you could query that table on both servers and make sure the result you get back is the same. Something like:

SELECT CHECKSUM_AGG(*) 
FROM   audit_base 
WHERE  action_timestamp BETWEEN <time1> AND BETWEEN <time2>

where and are round values to allow for different delays in contacting the databases. For instance, if you are checking at ten past the hour you might check items from the start the last hour to the start of this hour. You now have two small values that you can transmit somewhere and compare. If they are different then something has most likely gone wrong in the replication process - have what-ever pocess does the check/comparison send you a mail and an SMS so you know to check and fix any problem that needs attention.

By using SELECT CHECKSUM_AGG(*) the amount of data for each table is very very small so the bandwidth use of the checks will be insignificant. You just need to make sure your checks are not too expensive in the load that apply to the servers, and that you don't check data that might be part of open replication transactions so might be expected to be different at that moment (hence checking the audit trail a few minutes back in time instead of now in my example) otherwise you'll get too many false alarms.

Depending on your database structure the above might be impractical. For tables that are not insert-only (no updates or deletes) within the timeframe of your check (like an audit-trail as above), working out what can safely be compared while avoiding false alarms is likely to be both complex and expensive if not actually impossible to do reliably.

You could manufacture a rolling insert-only table if you do not already have one, by having a small table (containing just an indexed timestamp column) to which you add one row regularly - this data serves no purpose other than to exist so you can check updates to the table are getting replicated. You can delete data older than your checking window, so the table shouldn't grow large. Only testing one table does not prove that all the other tables are replicating (or any other tables for that matter), but finding an error in this one table would be a good "canery" check (if this table isn't updating in the replica, then the others probably aren't either).

This sort of check has the advantage of being independent of the replication process - you are not waiting for the replication process to record exceptions in logs, you are instead proactively testing some of the actual data.

David Spillett
Thanks David, that's a good answer. We have actually looked at this before and had put it to the side looking for an easier option. It may be something we'll have to look at now though.
Damovisa