User Stats: "interative calculation" or bulk calculation + caching

views:

answers:

+3 Q:

User Stats: "interative calculation" or bulk calculation + caching

I have a project that calculates a number of 'statistics' about a users performance and then shows it to them. All these statistics ultimately come from a large table of 'interactions' that record the users interaction with the site. At the moment, all these statistics are calculated by looking at this data. We make extensive use of persistent caching to keep this things going fast.

We're considering moving to an 'iterative design' where statistic values are stored in the db, and upon logging every interaction, we update the values depending on what that interactions contribution to each score is, so we're essentially iteratively updating the values. (right now we just dirty the cache).

I see some trouble with the iterative design because it means we have these redundant, potentially out of sync information stored in our database, it makes adding new statistics difficult, and means more work on every interaction log. The benefits though are that it simplifies statistic lookups to single db hit!

Something in this iterative design raises alarms bells for me, but I can't deny the potential time saving benefits. Should I obey this gut feeling, or go ahead and do it?

+1 A:

When I'm doing database design, I try to avoid storing redundant data whenever possible. (This is, after all, the object of database normalization). Calculated columns and views are OK - these are managed and updated automatically by SQL server. Personally, I would lean towards other avenues before using the DB for caching (Is the SQL query really the part that needs performance improvement? Can I simplify things in the application by using a SQL view? etc.)

When you say manipulating the data, what operation are you performing that's so expensive? Do you mean insert/update/deletes? If your usage of the statistical data is write-intensive, you might consider removing indexes to speed up data changes.

RMorrisey 2009-09-21 23:53:04

No actually database manipulation, I just meant pulling in the 200 or so rows, massaging what I want out of it (sometimes its all unqiue apparences of a particular values, sometimes its the count of a part number of rows that match something, etc). None of that involves DB writes, those only when stuff is logged (at the moment!).

brbob 2009-09-22 00:04:12

In that case, you may be better off adding indexes, instead of removing them. =) Before you do that, though, does the performance problem come from having multiple database hits? Or the slowness of the query being run? Or just the number of results being returned? If the problem comes from having multiple DB hits or from the way you calculate/pull out the data needed, you might be able to fix it by moving the logic to a view/common table expression (SQL Server)/user-defined function/stored procedure. I am not saying that caching data in the DB is wrong, just that it's a last resort for me.

RMorrisey 2009-09-22 00:26:45

Would triggers help, in that you can then do the calculations whenever new data comes in, which leads to not having stale data.

This would only be helpful if the read is much higher than the write though. If I am doing 2 writes for each read then this would be a bad design.

Some more details about what you are doing would be helpful

James Black 2009-09-22 00:01:38

Triggers/Signals would be how I'd trigger the iterative updates!Writes only happen (at the moment) when the data is logged from the user interaction. Statistic calculation does not involve writes. So overall its a more read intensive application.

brbob 2009-09-22 00:06:03

Calculating on an insert-basis is certainly the way to go, IMHO.

To cover the problems of, say, not being able to generate new stats immediately (because you don't have the calculated data), you can either:

Run a bulk report on the new stat offline

Calculate it on the fly and merge with the cache

Depending on your caching model, the stats may be out of sync, or they may not be. If it's with triggers, it happens immediately (on insert into tblFoo update tblFooStats); but you can retrieve this as required.

I would think the only real risk is as-mentioned: not being able to add new stats/calculations immediately. If you cover this, life should be quite nice.

Noon Silk 2009-09-22 00:15:53

ansaurus

tags:

views:

answers:

User Stats: "interative calculation" or bulk calculation + caching

related questions