views:

115

answers:

4

I have a pretty large table (20 mln records) which has a 3 column index and an array column. The array column is updated daily (by appending new values). There is also inserts, but not as much as updates.

The data in the array represents daily measurements corresponding to the three keys, something like this: [[date_id_1, my_value_for_date_1], [date_id_2, my_value_for_date_2]]. It is used to draw a graph of those daily values. Say I want to visualize the value for the key (a, b, c) over time, I do SELECT values FROM t WHERE a = my_a AND b = my_b AND c = my_c. Then I use the values array to draw the graph.

Performance of the updates (which happen in a bulk once a day) worsened considerably over time.

Using PostgreSQL 8.3.8.

Can you give me any hints of where to look for a solution? It could be anything from tweaking some parameters in postgres to even moving to another database (I guess a non-relational database would be better suited for this particular table, but I don't have much experience with those).

+1  A: 

Not sure if arrays are the way to go here.

Why not store these in a separate table (one value plus keys per row) then you bulk update will be pure insert activity.

James Anderson
A: 

Well a 3-column index is nothing to worry about. That doesn't necessarily make it that much slower. But that array-column might indeed be the problem. You say you are appending values to that array-column daily. By appending, do you mean appending values to all the 20 mln. records in the table? Or just some records?

The situation isn't completely clear to me, but I would suggest looking into ways of getting rid of that array-column. Making it a separate table for example. But, this depends on your situation and might not be an option. It might be just me, but I always feel 'dirty' having such a column in one of my tables. And most of the time there is a better solution for the problem you are trying to solve with that array-column. That being said, there are certainly situations in which such a column is valid, but at the moment, I can think of none. Certainly not in a table with a 20 mln. record count.

Cloud
I'm appending elements to some of the arrays, not all 20 mln. I used to have a different table, but that makes it HUGE and performance is much worse. Decided to denormalize and store data in those arrays, which made a huge improvement to selects and didn't worsen updates at first (though seems they get worse over time).
ionut bizau
Could you perhaps explain more about the nature of the data that's in the table, and in the array, and the relationship between them? There might still be a better solution. :)
Cloud
Added more details. See second paragraph. Not sure it really matters though.
ionut bizau
+5  A: 

I would take a look at the FILLFACTOR for the table. By default it's set to 100, you could lower it to 70 (to start with). After this, you have to do a VACUUM FULL to rebuild the table.

ALTER TABLE tablename SET (FILLFACTOR = 70);
VACUUM FULL tablename;
REINDEX TABLE tablename;

This gives UPDATE a chance to place the updated copy of a row on the same page as the original, which is more efficient than placing it on a different page. Now your database also has the option to do HOT updates.

Frank Heikens
As I see it, HOT is a way to avoid updating the index when doing an UPDATE. Right? If so, why would my performance decrease *over time*? My indexes were being updated the same way 4 months ago as they are now, yet my performance decreased considerably since then.
ionut bizau
Performance decrease might happen because new versions of a record (because of the update) will be placed on a different page. When you have a lot of records, you will have a lot of pages as well. Putting new version far away from the original version, will have an impact on queryplans. Use EXPLAIN to see what happens. Also consider CLUSTER to store you records in the same order as the indexes store their information. You have to play with the fillfactor, updated records have to stay close to the original ones.
Frank Heikens
I see. So every time I make a query, it has to go through the dead rows as well before getting to the one I want? This is what you mean (oversimplified)? That means the fillfactor should indeed help. Are the HOT updates related to the fillfactor? How do I enable the HOT updates once I change fillfactor?
ionut bizau
Also, as your arrays get larger, the number of tuples you can have on a single page will decrease. It will start to use TOAST storage for large arrays and then you'll be back in the position of having your data in an external table again, but with an expensive process for adding an element.
araqnid
HOT updates are related to the fillfactor, there must be enough space for the updates within the same page. If you update all records, you need a fillfactor < 50. PostgreSQL will take care of HOT updates, if possible. pg_stat_user_tables has information about HOT updates, see http://www.postgresql.org/docs/8.4/interactive/monitoring-stats.html#MONITORING-STATS-SETUP
Frank Heikens
+1 for "fill factor" with periodical "cluster". An upgrade to 8.4 may also help.
Tometzky
Thought about it some more. Say I have a fillfactor of 50. That allows each row to be updated at least once. Since many of my rows get updated, I just delay the slowdown, but not really fix it. Say I decrease fillfactor to 10. I just delay it even more. Most rows will be eventually updated, multiple times, so I'll hit the same problem, just later. Isn't there a way to clean those dead rows without a vacuum full?
ionut bizau
Why would you delay the slowdown? Did you turn off autovacuum? Don't you use HOT updates? VACUUM FULL is something you don't need, you need autovacuum and maybe a bit of configuration. Start testing, that's the only way to see what is happening.
Frank Heikens
I see. I think I understand now, but I would like to get your confirmation. :) So VACUUM actually removes the dead rows anyway. The problem appears because I do a bulk update, which doesn't give VACUUM the chance to do its job. Therefore FILLFACTOR (say 50) gives the bulk update the chance to put updates on the same page, then VACUUM comes after the update and cleans up.
ionut bizau
I actually tried this: created a table with FILLFACTOR 50 and copied all the data from the old table. Next day the bulk UPDATE was super fast, 3h instead of 9+h. However, two days later it went back to 9h. Looking into pg_stat_user_tables i discovered that autovacuum hadn't done its job yet, which might explain the slowness. Now it did, and I did a VACUUM ANALYZE manually too. Waiting for the next day's data to see. Should be back to 3h again. And it should never increase if the problem is what we think it is. Thanks a lot!
ionut bizau
Ouch, it's all back to 9h. FILLFACTOR is 50, VACUUM ANALYZE was done before the bulk updates. Now I'm completely confused. :-/
ionut bizau
Solved the problem. Brought the whole batch of 4M UPDATEs I have daily from 9h to 1h. \o/ I did something similar to CLUSTER, but manually. CLUSTER not only locks the table, but also uses a lot of resources. So I simply created another table with the same structure and inserted the records in the order I wanted (INSERT INTO ... SELECT FROM ... ORDER BY a, b, c) and made sure my UPDATEs happen in the same order as the physical order now on the disk (a, b, c). Learned A LOT while solving this. Thank you all!
ionut bizau
+2  A: 

The problem is in updates. Change the schema from array based to multiple-rows per day, and the performance problem will go away.

You can add rollups to arrays, later on, with some kind of cronjob, but avoid updates.

depesz
"rollups to arrays" would be really slow if I would have to build the arrays from scratch (go through each day, for each combination of a, b, c).
ionut bizau