views:

177

answers:

8

I'm creating a game with points for doing little things, so I have a schema as such:

create table points (
  id int,
  points int,
  reason varchar(10)
)

and to get the number of points a user has is trivial:

select sum(points) as total from points where id = ?

however, performance has become more and more of an issue as the points table expand. I want to do something like:

create table pointtotal (
  id int,
  totalpoints int
)

what is the best practice for keeping them in sync? Do I try to update pointtotal on every change? Do I run a daily script?

(Assume I have the right keys - they were left out for conciseness)

Edit:

Here are some characteristics that I left out but should be helpful:

Inserts/Updates to Points are not all that frequent There are a large number of entries, and there are a large number of requests - the keys were pretty trivial, as you can see.

+8  A: 

The best practice is to use a normalized database schema. Then the DBMS keeps it up to date, so you don't have to.

But I understand the tradeoff that makes a denormalized design attractive. In that case, the best practice is to update the total on every change. Investigate triggers. The advantage of this practice is that you can make the total keep in sync with the changes so you never have to think about whether it's out of date or not. If one change is committed, then the updated total is committed too.

However, this has some weaknesses with respect to concurrent changes. If you need to accommodate concurrent changes to the same totals, and you can tolerate the totals being "eventually consistent," then use periodic recalculation of the total, so you can be sure only one process at a time is changing the total.

Another good practice is to cache aggregate totals outside the database, e.g. memcached or in application variables, so you don't have to hit the database every time you need to display the value.


The query "select sum(points) as total from points where id = ?" should not take 2 seconds, even if you have a huge number of rows and a lot of requests.

If you have a covering index defined over (id, points) then the query can produce the result without reading data from the table at all; it can calculate the total by reading values from the index itself. Use EXPLAIN to analyze your query and look for the "Using index" note in the Extra column.

CREATE TABLE Points (
  id     INT,
  points INT,
  reason VARCHAR(10),
  KEY    id (id,points)
);

EXPLAIN SELECT SUM(points) AS total FROM Points WHERE id = 1;

+----+-------------+--------+------+---------------+------+---------+-------+------+--------------------------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref   | rows | Extra                    |
+----+-------------+--------+------+---------------+------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | points | ref  | id            | id   | 5       | const |    9 | Using where; Using index | 
+----+-------------+--------+------+---------------+------+---------+-------+------+--------------------------+
Bill Karwin
Ideally, but try convincing people to wait 2 seconds for a query!
Timmy
"select sum(points) as total from points where id = ?" should not take 2 seconds.
Bill Karwin
Triggers might be the way to go. I did not mention that inserts/updates are not that frequent.
Timmy
Also, it would be nice to avoid filesort for "order by sum( points )" type of query..
Timmy
+2  A: 

By all means keep the underlying table normalized. If you can deal with data potentially being one day old, run a script each nigh (you can schedule it), to do the roll up and populate the new table. Best to just re-create the thing each night from the source table to prevent any inconsistencies between the two.

That said, with the size of your record, you must either have very slow server, or very large # of records, because a record that small, with an indexed field on id should sum very quickly for you - however, I am of the mindset that if you can improve user response time by even a few seconds, there is no reason not to use rollup tables - even if DB purists object.

EJB
Large # of records and large # of requests.
Timmy
+1  A: 

Have the extra totalpoints column on the same table, and create/update the value of totalpoints for every row creation/update.

If you need totalpoints for a certain record, you can lookup the value without computing totalpoints. For example if you need the last value of totalpoint, you can get it like this:

SELECT totalpoint FROM point ORDER BY id DESC LIMIT 1;
Imran
+1  A: 

There is another approach: caching. Even if it's cached for only a few seconds or minutes, that is a win on a frequently accessed value. And it's possible to dissociate the cache-fetch with the cache-update. That way, a reasonably current value is always returned in constant time. The tricky bit is having the fetch spawning a new process to do the update.

staticsan
+1  A: 

I'd suggest to create a layer that you use to access and modify the data. You can use these DB access functions to encapsulate the data maintenance in all tables to keep the redundant data in sync.

Patrick Cornelissen
+1  A: 

You could go either way in this case, because it's not very complicated.

I prefer, as a general rule, to allow the data to be temporarily inconsistent, by having just enough redundancy, and have a periodic process resolve the inconsistencies. However, there is no harm in having a trigger mechanism to encourage early execution of the periodic process.

I feel this way because relying on event-based notification-style code to keep things consistent can, in more complex cases, greatly complicate the code and make verification difficult.

Mike Dunlavey
+1  A: 

You could also create another reporting schema and have it reload at fixed intervals via some process that does the calculations. This is not applicable to realtime information - but is very standard way of doing things.

TheSoftwareJedi
+1  A: 

Keeping Denormalized Values Correct

Seun Osewa