views:

218

answers:

1

I have been tasked with improving the performance of a slow running process which updates some data in a PostGres 8.3 database (running on Solaris, updates are driven by Perl 5.8 scripts through SOAP). About 50% of the time consumed I have very little control over so tuning my 50% is quite important.

There are usually about 4,500,000 rows in the table although I've seen it bloat out to about 7,000,000. The id that the update is querying on (not primary or unique) has just under 9000 distinct values and the spread of occurrences is weighted heavily towards 1 per id (median value is 20, max value 7000).

There is an index on this id but with such sparse data I wonder if there's a better way of doing things. I'm also considering denormalising things a bit (database is not super-normalised anyway) & pulling data out into a separate table (probably controlled/maintained by triggers) to help speed things up.

So far, I have made some pretty basic tweaks (not pinging the database every n seconds to see if it's alive, not setting session variables unnecessarily etc) and this is helping but I really feel that there's something I'm missing with the data ...

Even if someone says that pulling relevant data out into a separate table is an excellent/terrible idea that would be really helpful! Any other ideas (or further questions for clarification) gratefully received!

Query:

UPDATE tab1 SET client = 'abcd', invoice = 999 
    WHERE id = 'A1000062' and releasetime < '02-11-09'::DATE 
    AND charge IS NOT NULL AND invoice IS NULL AND client IS NULL;

I realise the 'is not null' is far from ideal. Id is indexed as are invoice & client (btrees, so I understand PostGres will/should/can use an index there). It's a pretty trivial query ...

Query plan (explain with analyze):

Bitmap Heap Scan on tab1 (cost=17.42..1760.21 rows=133 width=670) (actual time=0.603..0.603 rows=0 loops=1)
  Recheck Cond: (((id)::text = 'A1000062'::text) AND (invoice IS NULL))
  Filter: ((charge IS NOT NULL) AND (client IS NULL) AND (releasetime < '2009-11-02'::date))
  ->  Bitmap Index Scan on cdr_snapshot_2007_09_12_snbs_invoice  (cost=0.00..17.39 rows=450 width=0) (actual time=0.089..0.089 rows=63 loops=1)
    Index Cond: (((snbs)::text = 'A1000062'::text) AND (invoice IS NULL))
Total runtime: 0.674 ms

Autovacuum is, I believe, enabled. There are no foreign key constraints but thanks for the tip as I didn't know that.

I am really liking the idea of increasing the statistics value - I'll be having a play around with that straight away.

A: 

You really need to get some query plans, and edit your question to include them. In addition to helping to figure out better ways of doing things, they can also be used to easily measure the improvement.


You can affect performance either by changing the SQL, or by adjusting the indexes and statistics that are used to determine the query plan.


One possibility is that you have foreign key constraints that do not have supporting indexes. PostgreSQL does not add them automatically when you create a foreign key constraint. If the referenced table has a row deleted, (or referenced field updated), the referencing table will need to be scanned entirely to either cascade the delete, or to ensure that there are no rows referencing the deleted one.


If the distribution of your id field is quite skewed, increasing the statistics on that column may help.

If the statistics is set to 100, then the 100 most common ids (from a sample) will be recorded, along with their frequency. Say that covers about 50% of your table, leaving say 2 to 3.5 million rows which PostgreSQL will assume fall evenly amongst your other 8900 ids, or about 250 to 400 times each.

If the statistics was increased to 1000 and the top 1000 ids cover 95% of your rows, PostgreSQL will assume ids that are not in your list of 1000 most common will occur about 30 to 40 times each.

That change in estimates can affect the chosen query plan. If the pattern of queries more often selects ids that are the less frequently found ids, PostgreSQL will be over estimating how many times the ids will be found.

There is a performance cost for storing so many most frequent values, so you really need supporting query plan analysis to determine whether you're getting a net gain.

Stephen Denne