ansaurus

Question

How do I do large non-blocking updates in PostgreSQL?

Answer 1

+1 A:

Postgres uses MVCC (multi-version concurrency control), thus avoiding any locking if you are the only writer; any number of concurrent readers can work on the table, and there won't be any locking.

So if it really takes 5h, it must be for a different reason (e.g. that you do have concurrent writes, contrary to your claim that you don't).

Martin v. Löwis 2009-07-11 09:17:21

The times that I have quoted above (5 hours, 35 minutes, ~3 minutes) are accurate for the scenarios I described above. I didn't state that there were no other writes happening in the database; just that I know that no one is going to be writing to the *column* while I'm doing the update (this column is not being used by the system at all, the rows are read/written though). In other words, I don't care if this work is processed in one huge transaction or in smaller pieces; what I'm concerned about is speed. And I can increase speed using the methods above, but they are cumbersome.

2009-07-11 09:30:18

It's still not clear whether the long run-time is due to the locking, or, say, vacuuming. Try acquiring a table lock before the update, locking out any other kind of operation. Then you should be able to complete this update without any interference.

Martin v. Löwis 2009-07-11 10:06:07

If I lock every other kind of operation, then the system risks being stalled until it's complete. Whereas the two solutions I have posted for reducing the time to 35min/3min do not prevent the system from functioning normally. What I'm looking for is a way to do so without having to write a script each time I want to do an update like this (which would save me 5 minutes each time I wanted to do one of these updates).

2009-07-11 19:34:41

Answer 2

+1 A:

I am by no means a DBA, but a database design where you'd frequently have to update 35 million rows might have… issues.

A simple WHERE status IS NOT NULL might speed up things quite a bit (provided you have an index on status) – not knowing the actual use case, I'm assuming if this is run frequently, a great part of the 35 million rows might already have a null status.

However, you can make loops within the query via the LOOP statement. I'll just cook up a small example:

CREATE OR REPLACE FUNCTION nullstatus(count INTEGER) RETURNS integer AS $$
DECLARE
    i INTEGER := 0;
BEGIN
    FOR i IN 0..(count/1000 + 1) LOOP
        UPDATE orders SET status = null WHERE (order_id > (i*1000) and order_id <((i+1)*1000));
        RAISE NOTICE 'Count: % and i: %', count,i;
    END LOOP;
    RETURN 1;
END;
$$ LANGUAGE plpgsql;

It can then be run by doing something akin to:

SELECT nullstatus(35000000);

You might want to select the row count, but beware that the exact row count can take a lot of time. The PostgreSQL wiki has an article about slow counting and how to avoid it.

Also, the RAISE NOTICE part is just there to keep track on how far along the script is. If you're not monitoring the notices, or do not care, it would be better to leave it out.

mikl 2009-07-11 09:25:44

This will not help as function call will be in single transaction - so, the locking issue will be still there.

depesz 2009-07-11 10:11:29

Hmm, I had not considered that – still, I think this will be faster than UPDATE orders SET status = null;, since that would mean a full table scan.

mikl 2009-07-11 10:15:00

I understand the interest in the query running faster with an index, but that's not really my concern, as in some cases every value of the column is the same, rendering an index useless. I'm really concerned in the difference in time between running this query as one operation (5 hours) and breaking it up into pieces (3 minutes) and wanting to do so within psql without having to write a script every time. I do know about indexes and how to possibly save even more time on these operations by using them.

2009-07-11 19:37:04

Oh, and to answer the first part of your question: it is indeed rare to have to update 35 million rows. This is mostly for cleanup; for example, we might decide, "why does order_status = 'a' mean 'accepted' for the orders table and 'annuled' for the shipping table? we should make these consistent!" and so we need to update the code and do a mass update to the database to clean up the inconsistency.Of course this is an abstraction, as we don't actually have "orders" at all.

2009-07-11 19:42:50

Answer 3

+1 A:

First of all - are you sure that you need to update all rows?

Perhaps some of the rows have status already NULL?

If so, then do:

UPDATE orders SET status = null WHERE status is not null;

As for partitioning the change - in pure sql it is not possible.

The problem you have is that all updates are in single transaction.

One possible way to do it in "pure sql" would be to install dblink, connect to the same database using dblink, and then issue a lot of updates over dblink, but it seems like overkill for such a simple task.

Usually just adding proper where solves the problem. If it doesn't - just partition it manually (writing script is too much - you can usually make it in one simple one-liner).

Example:

perl -e '
    for (my $i = 0; $i <= 3500000; $i += 1000) {
        printf "UPDATE orders SET status = null WHERE status is not null and order_id between %u and %u;\n",
        $i, $i+999
    }
'

(I put it in here in many lines for readability, generally - it's a single line). Output of above command can be fed directly to psql:

perl -e '...' | psql -U ... -d ...

or, first to file, and then to psql (in case you'd need the file later on):

perl -e '...' > updates.partitioned.sql
psql -U ... -d ... -f updates.partitioned.sql

depesz 2009-07-11 10:24:58

I appreciate your response, but it is basically identical to my #3 solution in my question; basically, this is what I already do. However, it takes 5 minutes to write out a script like this, whereas I'm trying to figure out a way to just do it within psql and therefore do it in 20 seconds or less (and also eliminate potential typos/bugs). That's the question I'm asking.

2009-07-11 19:40:13

And I thought I answered it - it is not possible to do it in SQL (unless using tricks like dblink). On the other hand - I wrote that one-liner that I showed in around 30 seconds, so it doesn't look like too much time :) It's definitely closer to your 20 second target, than hypothetical 5-minute script writing.

depesz 2009-07-11 23:02:40

Thanks, but I misspoke when I said 'SQL'; in fact I'm asking how to do it in the psql console in PostgreSQL, using any tricks possible, including plgpsql. Writing the script as above is exactly what I'm doing now. It takes more than 30 seconds because you have to write a custom mini-script every time you do one of these updates, and you have to do a query to find out how many rows you have, and you have to make sure there are no typos, etc etc. What I'd like to do is something like:# select nonblocking_query('update orders set status=null'); That is what I am trying to accomplish.

2009-07-12 10:44:53

And this is what I already 2 times answered: it's not possible, unless you will use dblink, but this is even more complicated than those one-liners you don't like.

depesz 2009-07-12 12:28:11

Answer 4

+1 A:

You should delegate this column to another table like this:

create table order_status (
  order_id int not null references orders(order_id) primary key,
  status int not null
);

Then your operation of setting status=NULL will be instant:

truncate order_status;

Tometzky 2009-07-14 11:50:43

Answer 5

+1 A:

Are you sure this is because of locking? I don't think so and there's many other possible reasons. To find out you can always try to do just the locking. Try this: BEGIN; SELECT NOW(); SELECT * FROM order FOR UPDATE; SELECT NOW(); ROLLBACK;

To understand what's really happening you should run an EXPLAIN first (EXPLAIN UPDATE orders SET status...) and/or EXPLAIN ANALYZE. Maybe you'll find out that you don't have enough memory to do the UPDATE efficiently. If so, SET work_mem TO 'xxxMB'; might be a simple solution.

Also, tail the PostgreSQL log to see if some performance related problems occurs.

Martin Torhage 2009-07-14 21:07:54

ansaurus

tags:

views:

answers:

How do I do large non-blocking updates in PostgreSQL?

related questions