views:

71

answers:

2

So we have a production database that is 32GB on a machine with 16GB of RAM. Thanks to caching this is usually not a problem at all. But whenever I start a pg_dump of the database, queries from the app servers start queueing up, and after a few minutes the queue runs away and our app grinds to a halt.

I'll be the first to acknowledge that we have query performance issues, and we're addressing those. Meanwhile, I want to be able to run pg_dump nightly, in a way that sips from the database and doesn't take our app down. I don't care if it takes hours. Our app doesn't run any DDL, so I'm not worried about lock contention.

Attempting to fix the problem, I'm running pg_dump with both nice and ionice. Unfortunately, this doesn't address the issue.

nice ionice -c2 -n7 pg_dump -Fc production_db -f production_db.sql

Even with ionice I still see the issue above. It appears that i/o wait and lots of seeks are causing the problem.

vmstat 1 

Shows me that iowait hovers around 20-25% and spikes to 40% sometimes. Real CPU % fluctuates between 2-5% and spikes to 70% sometimes.

I don't believe locks are a possible culprit. When I run this query:

select pg_class.relname,pg_locks.* from pg_class,pg_locks where pg_class.relfilenode=pg_locks.relation;

I only see locks which are marked granted = 't'. We don't typically run any DDL in production -- so locks don't seem to be the issue.

Here is output from a ps with the WCHAN column enabled:

PID WIDE               S TTY          TIME COMMAND
3901 sync_page         D ?        00:00:50 postgres: [local] COPY
3916 -                 S ?        00:00:01 postgres:  SELECT
3918 sync_page         D ?        00:00:07 postgres:  INSERT
3919 semtimedop        S ?        00:00:04 postgres:  SELECT
3922 -                 S ?        00:00:01 postgres:  SELECT
3923 -                 S ?        00:00:01 postgres:  SELECT
3924 -                 S ?        00:00:00 postgres:  SELECT
3927 -                 S ?        00:00:06 postgres:  SELECT
3928 -                 S ?        00:00:06 postgres:  SELECT
3929 -                 S ?        00:00:00 postgres:  SELECT
3930 -                 S ?        00:00:00 postgres:  SELECT
3931 -                 S ?        00:00:00 postgres:  SELECT
3933 -                 S ?        00:00:00 postgres:  SELECT
3934 -                 S ?        00:00:02 postgres:  SELECT
3935 semtimedop        S ?        00:00:13 postgres:  UPDATE waiting
3936 -                 R ?        00:00:12 postgres:  SELECT
3937 -                 S ?        00:00:01 postgres:  SELECT
3938 sync_page         D ?        00:00:07 postgres:  SELECT
3940 -                 S ?        00:00:07 postgres:  SELECT
3943 semtimedop        S ?        00:00:04 postgres:  UPDATE waiting
3944 -                 S ?        00:00:05 postgres:  SELECT
3948 sync_page         D ?        00:00:05 postgres:  SELECT
3950 sync_page         D ?        00:00:03 postgres:  SELECT
3952 sync_page         D ?        00:00:15 postgres:  SELECT
3964 log_wait_commit   D ?        00:00:04 postgres:  COMMIT
3965 -                 S ?        00:00:03 postgres:  SELECT
3966 -                 S ?        00:00:02 postgres:  SELECT
3967 sync_page         D ?        00:00:01 postgres:  SELECT
3970 -                 S ?        00:00:00 postgres:  SELECT
3971 -                 S ?        00:00:01 postgres:  SELECT
3974 sync_page         D ?        00:00:00 postgres:  SELECT
3975 -                 S ?        00:00:00 postgres:  UPDATE
3977 -                 S ?        00:00:00 postgres:  INSERT
3978 semtimedop        S ?        00:00:00 postgres:  UPDATE waiting
3981 semtimedop        S ?        00:00:01 postgres:  SELECT
3982 -                 S ?        00:00:00 postgres:  SELECT
3983 semtimedop        S ?        00:00:02 postgres:  UPDATE waiting
3984 -                 S ?        00:00:04 postgres:  SELECT
3986 sync_buffer       D ?        00:00:00 postgres:  SELECT
3988 -                 R ?        00:00:01 postgres:  SELECT
3989 -                 S ?        00:00:00 postgres:  SELECT
3990 -                 R ?        00:00:00 postgres:  SELECT
3992 -                 R ?        00:00:01 postgres:  SELECT
3993 sync_page         D ?        00:00:01 postgres:  SELECT
3994 sync_page         D ?        00:00:00 postgres:  SELECT
A: 

There's an app for that...

The easiest:
You can throttle pg_dump using throttle.

The harder:
Change your backup procedure. Use for example:

psql -c 'pg_start_backup()'
rsync --checksum --archive /var/lib/pgsql /backups/pgsql
psql -c 'pg_stop_backup()'

Even harder:
You can setup a replicated database (using for example log shipping) on additional cheap disk and instead of backing up production database backup a replica. Even it will fall behind some transactions it will eventually catch up. But check if replica is reasonably up to date before starting backup.

Tometzky
So I tried throttle at 3Mbps and even 1Mbps, and amazingly they both still cause the issue. I think I'll need to try your second or third approach!
A: 

Your PS output has multiple UPDATE statements in "waiting" state, which still says locks to me (your locks test query aside). I'm pretty sure you wouldn't see "waiting" in the PS output otherwise. Can you check to see if this query shows anything during the issue:

SELECT * FROM pg_stat_activity WHERE waiting;

(You didn't say what version of PostgreSQL you are running so I'm not sure if this will work.)

If there's anything in there (that is, with waiting = TRUE), then it's a lock/transaction problem.

Matthew Wood
Thanks. I tried that, and yes there are always 1-2 locks but never a bunch of them, and none of the locks last for long.