views:

537

answers:

3

Hi. I'm hoping someone can help me a WAL-shipping and warm standby issue. My standby system runs happily for weeks, then all of a sudden it starts looking for .history files that don't exist. It then craps out and I can't successfully restart it without rebuilding the standby.

Both systems are running CentOS 4.5 and postgres 8.4.1. They use NFS to store WAL files from production on the standby.

A relevant chunk of the log, with my comments:

[** Recovery is running normally **]

Trigger file            : /tmp/pgsql.trigger
Waiting for WAL file    : 00000001000000830000005B
WAL file path           : /var/tafkan_backup_from_db1/00000001000000830000005B
Restoring to            : pg_xlog/RECOVERYXLOG
Sleep interval          : 2 seconds
Max wait interval       : 0 forever
Command for restore     : cp "/var/tafkan_backup_from_db1/00000001000000830000005B" "pg_xlog/RECOVERYXLOG"
Keep archive history    : 00000001000000830000004D and later
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
running restore         : OK

Trigger file            : /tmp/pgsql.trigger
Waiting for WAL file    : 00000001000000830000005B
WAL file path           : /var/tafkan_backup_from_db1/00000001000000830000005B
Restoring to            : pg_xlog/RECOVERYXLOG
Sleep interval          : 2 seconds
Max wait interval       : 0 forever
Command for restore     : cp "/var/tafkan_backup_from_db1/00000001000000830000005B" "pg_xlog/RECOVERYXLOG"
Keep archive history    : 000000000000000000000000 and later
running restore         : OK

[** All of a sudden it starts looks for .history files **]

Trigger file            : /tmp/pgsql.trigger
Waiting for WAL file    : 00000002.history
WAL file path           : /var/tafkan_backup_from_db1/00000002.history
Restoring to            : pg_xlog/RECOVERYHISTORY
Sleep interval          : 2 seconds
Max wait interval       : 0 forever
Command for restore     : cp "/var/tafkan_backup_from_db1/00000002.history" "pg_xlog/RECOVERYHISTORY"
Keep archive history    : 000000000000000000000000 and later
running restore         :cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
not restored
history file not found
Trigger file            : /tmp/pgsql.trigger
Waiting for WAL file    : 00000001.history
WAL file path           : /var/tafkan_backup_from_db1/00000001.history
Restoring to            : pg_xlog/RECOVERYHISTORY
Sleep interval          : 2 seconds
Max wait interval       : 0 forever
Command for restore     : cp "/var/tafkan_backup_from_db1/00000001.history" "pg_xlog/RECOVERYHISTORY"
Keep archive history    : 000000000000000000000000 and later
running restore         :cp: cannot stat `/var/tafkan_backup_from_db1/00000001.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000001.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000001.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000001.history': No such file or directory
not restored
history file not found

[** I stopped Postgres, renamed recovery.done to recovery.conf, and restarted it. **]

Trigger file            : /tmp/pgsql.trigger
Waiting for WAL file    : 00000002.history
WAL file path           : /var/tafkan_backup_from_db1/00000002.history
Restoring to            : pg_xlog/RECOVERYHISTORY
Sleep interval          : 2 seconds
Max wait interval       : 0 forever
Command for restore     : cp "/var/tafkan_backup_from_db1/00000002.history" "pg_xlog/RECOVERYHISTORY"
Keep archive history    : 000000000000000000000000 and later
running restore         :cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
not restored
history file not found
Trigger file            : /tmp/pgsql.trigger
Waiting for WAL file    : 0000000200000083000000A2
WAL file path           : /var/tafkan_backup_from_db1/0000000200000083000000A2
Restoring to            : pg_xlog/RECOVERYXLOG
Sleep interval          : 2 seconds
Max wait interval       : 0 forever
Command for restore     : cp "/var/tafkan_backup_from_db1/0000000200000083000000A2" "pg_xlog/RECOVERYXLOG"
Keep archive history    : 000000000000000000000000 and later
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...

[** This file is not present. All WAL files start with 00000001. **] 

Any ideas? I don't even know what .history files are, and the (mostly excellent) documentation is not very clear on any of this.

PS. I wish I was running VMs so I could use link text and not have to worry about any of this application-level HA nonsense :-)

Update: Here are some of the logs from the standby server at about this time. It looks like something made the server stop recovering and come online, but I don't know what. I'm pretty certain that nothing could have created the trigger file.

2010-01-20 03:30:15 EST 4b3a5c63.401b LOG:  restored log file "00000001000000830000005A" from archive
2010-01-20 03:30:23 EST 4b3a5c63.401b LOG:  restored log file "00000001000000830000005B" from archive
2010-01-20 03:30:23 EST 4b3a5c63.401b LOG:  record with zero length at 83/5BFA2FF8
2010-01-20 03:30:23 EST 4b3a5c63.401b LOG:  redo done at 83/5BFA2FAC
2010-01-20 03:30:23 EST 4b3a5c63.401b LOG:  last completed transaction was at log time 2010-01-20 03:28:04.594399-05
2010-01-20 03:30:25 EST 4b3a5c63.401b LOG:  restored log file "00000001000000830000005B" from archive
2010-01-20 03:30:37 EST 4b3a5c63.401b LOG:  selected new timeline ID: 2
2010-01-20 03:30:49 EST 4b3a5c63.401b LOG:  archive recovery complete
2010-01-20 03:30:59 EST 4b3a5c62.4019 LOG:  database system is ready to accept connections
+1  A: 

A totally different approach for HA might be to host the PG database on a DRBD device shared between the two machines.

Bandi-T
Thanks for the suggestion! That's probably what I'll do if I can't get get WAL-shipping working reliably.
sbleon
+1  A: 

Did you use your own recovery script/program? If yes - please don't do it. Use pg_standby from PostgreSQL contrib.

Otherwise - just ignore the .history files.

depesz
I'm using pg_standby. recovery.conf contains:"restore_command = 'pg_standby -l -d -s 2 -t /tmp/pgsql.trigger /var/tafkan_backup_from_db1 %f %p %r 2>>standby.log'".I can't ignore the .history files because when pg_standby starts looking for them the recovery fails, recovery.conf gets moved to recovery.done, and the WAL files start rapidly piling up.
sbleon
+1  A: 

Your replicated copy came online at some point. "00000002.history" is looking for a history file for timeline 00000002, whereas the rest of your logs start with 00000001, which is the original timeline.

I would check your PostgreSQL logs right before it started looking for the history file to see if there's any indication the DB came online, even for a moment.

Matthew Wood
Thanks, Matthew. I added some of the logs to my question. You're right that something made it come online, but I can't imagine what, or why.
sbleon
Did something happen on the source side? The entry "record with zero length at 83/5BFA2FF8" looks like it's only a partial WAL log that it tried to restore. IIRC, when it runs into an invalid record in the WAL, it rolls back to the last *good* record in that WAL and then comes online, regardless of the existence of a trigger file.I would look at both systems logs around 2010-01-20 03:28:04.594399-05 and see if there were any errors in either Postgres, the OS, or the network.
Matthew Wood
That behavior makes sense. If Backup sees something that looks like a failure of Primary, it assumes that Primary has died and that it should pick up the slack. I suspect that there may be a networking issue here. I'm going to look into that angle. Thanks!
sbleon