views:

131

answers:

5

I have a normalized database and need to produce web based reports frequently that involve joins across multiple tables. These queries are taking too long, so I'd like to keep the results computed so that I can load pages quickly. There are frequent updates to the tables I am summarising, and I need the summary to reflect all update so far.

All tables have autoincrement primary integer keys, and I almost always add new rows and can arrange to clear the computed results in they change.

I approached a similar problem where I needed a summary of a single table by arranging to iterate over each row in the table, and keep track of the iterator state and the highest primary keen (i.e. "highwater") seen. That's fine for a single table, but for multiple tables I'd end up keeping one highwater value per table, and that feels complicated. Alternatively I could denormalise down to one table (with fairly extensive application changes), which feels a step backwards and would probably change my database size from about 5GB to about 20GB.

(I'm using sqlite3 at the moment, but MySQL is also an option).

+2  A: 

I see two approaches:

  1. You move the data in a separate database, denormalized, putting some precalculation, to optimize it for quick access and reporting (sounds like a small datawarehouse). This implies you have to think of some jobs (scripts, separate application, etc.) that copies and transforms the data from the source to the destination. Depending on the way you want the copying to be done (full/incremental), the frequency of copying and the complexity of data model (both source and destination), it might take a while to implement and then to optimizie the process. It has the advantage that leaves your source database untouched.

  2. You keep the current database, but you denormalize it. As you said, this might imply changing in the logic of the application (but you might find a way to minimize the impact on the logic using the database, you know the situation better than me :) ).

Cătălin Pitiș
#1 sounds standard, and appropriate. staging area --> normalized area --> reporting area
John Pirie
Yes, that's why I put it first :)
Cătălin Pitiș
+1  A: 

You can create triggers.

As soon as one of the calculated values changes, you can do one of the following:

  • Update the calculated field (Preferred)
  • Recalculate your summary table
  • Store a flag that a recalculation is necessary. The next time you need the calculated values check this flag first and do the recalculation if necessary

Example:

CREATE TRIGGER update_summary_table UPDATE OF order_value ON orders 
BEGIN
  UPDATE summary 
    SET total_order_value = total_order_value 
                          - old.order_value 
                          + new.order_value 
    // OR: Do a complete recalculation
    // OR: Store a flag
END;

More Information on SQLite triggers: http://www.sqlite.org/lang_createtrigger.html

DR
The trouble is that performing the query again from scratch in each update uses too much CPU time (and might cause too much contention on the database locks).
Dickon Reed
That is the reason why I added the first and the third option: Either update only those fields which are necessary or write a marker that an update is needed and postpone the recalculation to a later time.
DR
+1  A: 

Can the reports be refreshed incrementally, or is it a full recalculation to rework the report? If it has to be a full recalculation then you basically just want to cache the result set until the next refresh is required. You can create some tables to contain the report output (and metadata table to define what report output versions are available), but most of the time this is overkill and you are better off just saving the query results off to a file or other cache store.

If it is an incremental refresh then you need the PK ranges to work with anyhow, so you would want something like your high water mark data (except you may want to store min/max pairs).

A: 

In the end I arranged for a single program instance to make all database updates, and maintain the summaries in its heap, i.e. not in the database at all. This works very nicely in this case but would be inappropriate if I had multiple programs doing database updates.

Dickon Reed
A: 

You haven't said anything about your indexing strategy. I would look at that first - making sure that your indexes are covering.

Then I think the trigger option discussed is also a very good strategy.

Another possibility is the regular population of a data warehouse with a model suitable for high performance reporting (for instance, the Kimball model).

Cade Roux
Oh, I checked my indexing first :) There's only any point in looking at this kind of thing if you've already made the queries run as fast as you think you can.
Dickon Reed