ansaurus

Question

Best practice for keeping denormalized schema up to date?

Answer 1

+8 A:

The best practice is to use a normalized database schema. Then the DBMS keeps it up to date, so you don't have to.

But I understand the tradeoff that makes a denormalized design attractive. In that case, the best practice is to update the total on every change. Investigate triggers. The advantage of this practice is that you can make the total keep in sync with the changes so you never have to think about whether it's out of date or not. If one change is committed, then the updated total is committed too.

However, this has some weaknesses with respect to concurrent changes. If you need to accommodate concurrent changes to the same totals, and you can tolerate the totals being "eventually consistent," then use periodic recalculation of the total, so you can be sure only one process at a time is changing the total.

Another good practice is to cache aggregate totals outside the database, e.g. memcached or in application variables, so you don't have to hit the database every time you need to display the value.

The query "select sum(points) as total from points where id = ?" should not take 2 seconds, even if you have a huge number of rows and a lot of requests.

If you have a covering index defined over (id, points) then the query can produce the result without reading data from the table at all; it can calculate the total by reading values from the index itself. Use EXPLAIN to analyze your query and look for the "Using index" note in the Extra column.

CREATE TABLE Points (
  id     INT,
  points INT,
  reason VARCHAR(10),
  KEY    id (id,points)
);

EXPLAIN SELECT SUM(points) AS total FROM Points WHERE id = 1;

+----+-------------+--------+------+---------------+------+---------+-------+------+--------------------------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref   | rows | Extra                    |
+----+-------------+--------+------+---------------+------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | points | ref  | id            | id   | 5       | const |    9 | Using where; Using index | 
+----+-------------+--------+------+---------------+------+---------+-------+------+--------------------------+

Bill Karwin 2009-05-12 23:08:26

Ideally, but try convincing people to wait 2 seconds for a query!

Timmy 2009-05-12 23:11:16

"select sum(points) as total from points where id = ?" should not take 2 seconds.

Bill Karwin 2009-05-12 23:14:22

Triggers might be the way to go. I did not mention that inserts/updates are not that frequent.

Timmy 2009-05-12 23:16:39

Also, it would be nice to avoid filesort for "order by sum( points )" type of query..

Timmy 2009-05-13 18:17:09

Answer 2

+2 A:

By all means keep the underlying table normalized. If you can deal with data potentially being one day old, run a script each nigh (you can schedule it), to do the roll up and populate the new table. Best to just re-create the thing each night from the source table to prevent any inconsistencies between the two.

That said, with the size of your record, you must either have very slow server, or very large # of records, because a record that small, with an indexed field on id should sum very quickly for you - however, I am of the mindset that if you can improve user response time by even a few seconds, there is no reason not to use rollup tables - even if DB purists object.

EJB 2009-05-12 23:13:24

Large # of records and large # of requests.

Timmy 2009-05-12 23:14:22

Answer 3

+1 A:

Have the extra totalpoints column on the same table, and create/update the value of totalpoints for every row creation/update.

If you need totalpoints for a certain record, you can lookup the value without computing totalpoints. For example if you need the last value of totalpoint, you can get it like this:

SELECT totalpoint FROM point ORDER BY id DESC LIMIT 1;

Imran 2009-05-12 23:16:15

Answer 4

+1 A:

There is another approach: caching. Even if it's cached for only a few seconds or minutes, that is a win on a frequently accessed value. And it's possible to dissociate the cache-fetch with the cache-update. That way, a reasonably current value is always returned in constant time. The tricky bit is having the fetch spawning a new process to do the update.

staticsan 2009-05-12 23:18:20

Answer 5

+1 A:

I'd suggest to create a layer that you use to access and modify the data. You can use these DB access functions to encapsulate the data maintenance in all tables to keep the redundant data in sync.

Patrick Cornelissen 2009-05-14 14:32:04

Answer 6

+1 A:

You could go either way in this case, because it's not very complicated.

I prefer, as a general rule, to allow the data to be temporarily inconsistent, by having just enough redundancy, and have a periodic process resolve the inconsistencies. However, there is no harm in having a trigger mechanism to encourage early execution of the periodic process.

I feel this way because relying on event-based notification-style code to keep things consistent can, in more complex cases, greatly complicate the code and make verification difficult.

Mike Dunlavey 2009-05-16 14:58:28

Answer 7

+1 A:

You could also create another reporting schema and have it reload at fixed intervals via some process that does the calculations. This is not applicable to realtime information - but is very standard way of doing things.

TheSoftwareJedi 2009-05-16 15:04:31

Answer 8

+1 A:

Keeping Denormalized Values Correct

Seun Osewa 2009-05-27 04:57:15

ansaurus

tags:

views:

answers:

Best practice for keeping denormalized schema up to date?

related questions