views:

582

answers:

2

Hi there,

I have been running an UPDATE on a table containing 250 million rows with 3 index'; this UPDATE uses another table containing 30 million rows. It has been running for about 36 hours now. I am wondering if their is a way to find out how close it is to being done for if it plans to take a million days to do its thing, I will kill it; yet if it only needs another day or two, I will let it run. Here is the command-query:

UPDATE pagelinks SET pl_to = page_id
    FROM page
    WHERE 
        (pl_namespace, pl_title) = (page_namespace, page_title)
        AND
        page_is_redirect = 0
;

The EXPLAIN is not the issue here and I only mention the big table's having multiple indexes in order to somewhat justify how long it takes to UPDATE it. But here is the EXPLAIN anyway:

Merge Join  (cost=127710692.21..135714045.43 rows=452882848 width=57)
  Merge Cond: (("outer".page_namespace = "inner".pl_namespace) AND ("outer"."?column4?" = "inner"."?column5?"))
  ->  Sort  (cost=3193335.39..3219544.38 rows=10483593 width=41)
        Sort Key: page.page_namespace, (page.page_title)::text
        ->  Seq Scan on page  (cost=0.00..439678.01 rows=10483593 width=41)
              Filter: (page_is_redirect = 0::numeric)
  ->  Sort  (cost=124517356.82..125285665.74 rows=307323566 width=46)
        Sort Key: pagelinks.pl_namespace, (pagelinks.pl_title)::text"
        ->  Seq Scan on pagelinks  (cost=0.00..6169460.66 rows=307323566 width=46)

Now I also sent a parallel query-command in order to DROP one of pagelinks' indexes; of course it is waiting for the UPDATE to finish (but I felt like trying it anyway!). Hence, I cannot SELECT anything from pagelinks for fear of corrupting the data (unless you think it would be safe to kill the DROP INDEX postmaster process?).

So I am wondering if their is a table that would keep track of the amount of dead tuples or something for It would be nice to know how fast or how far the UPDATE is in the completion of its task.

Thx (PostgreSQL is not as intelligent as I thought; it needs heuristics)

+1  A: 

Did you read the PostgreSQL documentation for "Using EXPLAIN", to interpret the output you're showing?

I'm not a regular PostgreSQL user, but I just read that doc, and then compared to the EXPLAIN output you're showing. Your UPDATE query seems to be using no indexes, and it's forced to do table-scans to sort both page and pagelinks. The sort is no doubt large enough to need temporary disk files, which I think are created under your temp_tablespace.

Then I see the estimated database pages read. The top-level of that EXPLAIN output says (cost=127710692.21..135714045.43). The units here are in disk I/O accesses. So it's going to access the disk over 135 million times to do this UPDATE.

Note that even 10,000rpm disks with 5ms seek time can achieve at best 200 I/O operations per second under optimal conditions. This would mean that your UPDATE would take 188 hours (7.8 days) of disk I/O, even if you could sustain saturated disk I/O for that period (i.e. continuous reads/writes with no breaks). This is impossible, and I'd expect the actual throughput to be off by at least an order of magnitude, especially since you have no doubt been using this server for all sorts of other work in the meantime. So I'd guess you're only a fraction of the way through your UPDATE.

If it were me, I would have killed this query on the first day, and found another way of performing the UPDATE that made better use of indexes and didn't require on-disk sorting. You probably can't do it in a single SQL statement.

As for your DROP INDEX, I would guess it's simply blocking, waiting for exclusive access to the table, and while it's in this state I think you can probably kill it.

Bill Karwin
Postgres' cost estimate is based on the number of memory pages it needs to touch. I think each page is 1/2 KB.
Barry Brown
The doc I read says: "Traditional practice is to measure the costs in units of disk page fetches"
Bill Karwin
I'm not saying you're wrong, only that my answer was based on what the doc said. :-)
Bill Karwin
I think we're both right, because on typical Unix systems, disk blocks and memory pages are the same size. But the OS will be able to fetch multiple blocks at once.
Barry Brown
I did actually kill it at the end of the second day! I have been waiting for the VACUUM FULL on this table to finish for the last 2 days now! Thanks for the knowledge concerning the EXPLAIN; I knew it was a measure of time, but now I know that it is in disk IO.
Nicholas Leonard
I still wonder, does this estimate consider allocated RAM?
Nicholas Leonard
No. It's just an estimate of the number of pages that it needs to look at, whether or not they're in memory.
Barry Brown
Thx again. This is good info.
Nicholas Leonard
A: 

You need indexes or, as Bill pointed out, it will need to do sequential scans on all the tables.

CREATE INDEX page_ns_title_idx on page(page_namespace, page_title);
CREATE INDEX pl_ns_title_idx on pagelink(pl_namespace, pl_title);
CREATE INDEX page_redir_idx on page(page_is_redirect);
Barry Brown
I already had those indexes. I went into the postgresql.conf file and played with query optimizer and index related variables in order to strongly encourage the QO to choose wisely! When my VACUUM of the table finish's, I will try it out without the redirect clause. Thx
Nicholas Leonard