ansaurus

Question

Postgres HA (based on WAL-shipping) fails

Answer 1

+1 A:

A totally different approach for HA might be to host the PG database on a DRBD device shared between the two machines.

Bandi-T 2010-01-20 15:19:12

Thanks for the suggestion! That's probably what I'll do if I can't get get WAL-shipping working reliably.

sbleon 2010-01-20 20:08:51

Answer 2

+1 A:

Did you use your own recovery script/program? If yes - please don't do it. Use pg_standby from PostgreSQL contrib.

Otherwise - just ignore the .history files.

depesz 2010-01-20 15:32:00

I'm using pg_standby. recovery.conf contains:"restore_command = 'pg_standby -l -d -s 2 -t /tmp/pgsql.trigger /var/tafkan_backup_from_db1 %f %p %r 2>>standby.log'".I can't ignore the .history files because when pg_standby starts looking for them the recovery fails, recovery.conf gets moved to recovery.done, and the WAL files start rapidly piling up.

sbleon 2010-01-20 20:07:38

Answer 3

+1 A:

Your replicated copy came online at some point. "00000002.history" is looking for a history file for timeline 00000002, whereas the rest of your logs start with 00000001, which is the original timeline.

I would check your PostgreSQL logs right before it started looking for the history file to see if there's any indication the DB came online, even for a moment.

Matthew Wood 2010-01-21 00:09:04

Thanks, Matthew. I added some of the logs to my question. You're right that something made it come online, but I can't imagine what, or why.

sbleon 2010-01-21 20:43:27

Did something happen on the source side? The entry "record with zero length at 83/5BFA2FF8" looks like it's only a partial WAL log that it tried to restore. IIRC, when it runs into an invalid record in the WAL, it rolls back to the last *good* record in that WAL and then comes online, regardless of the existence of a trigger file.I would look at both systems logs around 2010-01-20 03:28:04.594399-05 and see if there were any errors in either Postgres, the OS, or the network.

Matthew Wood 2010-01-21 22:00:24

That behavior makes sense. If Backup sees something that looks like a failure of Primary, it assumes that Primary has died and that it should pick up the slack. I suspect that there may be a networking issue here. I'm going to look into that angle. Thanks!

sbleon 2010-01-22 20:46:18

ansaurus

tags:

views:

answers:

Postgres HA (based on WAL-shipping) fails

related questions