views:

67

answers:

1

I have run into the following situation several times, and was wondering what best practices say about this situation:

Rows are inserted into a table as users complete some action. For example, every time a user visits a specific portion of a website, a row is inserted indicating their IP address, username, and referring URL. Elsewhere, I want to show summary information about those actions. In our example, I'd want to allow administrators to log onto the website and see how many visits there are for a specific user.

The most natural way to do this (IMO) is to insert a row for every visit and, every time the administrator requests totals, count up the number of rows in the appropriate table for that user. However, in situations like these there can be thousands and thousands of rows per user. If administrators frequently request the totals, constantly requesting the counts could create quite a load on the database. So it seems like the right solution is to insert individual rows but simultaneous keep some kind of summary data with running totals as data is inserted (to avoid recalculating those totals over and over).

What are the best practices or most common database schema design for this situation? You can ignore the specific example I made up, my real question is how to handle cases like this dealing with high-volume data and frequently-requested totals or counts of that data.

+2  A: 

Here are a couple of practices; the one you select will depend upon your specific situation:

  1. Trust your database engine Many database engines will automatically cache the query plans (and results) of frequently used queries. Even if the underlying data has changed, the query plan itself will remain the same. Appropriate parts of indexes will be kept in main memory, making rerunning a given query almost free. The most you may need to do in this case is tune the database's parameters.

  2. Denormalize your database While 3rd-Normal Form (3NF) is still considered the appropriate database design, for performance reasons it can become necessary to add additional tables that include summary values that would normally be calculated as necessary via a SELECT ... GROUP BY ... query. Frequently these other tables are kept up to date by the use of triggers, stored procedures, or background processes. See Wikipedia for more about Denormalization.

  3. Data warehousing With a Data warehouse, the goal is to push copies of live data to secondary databases (warehouses) for query and special reporting purposes. This is usually done with background processes using whatever replication techniques your database supports. These warehouses are frequently indexed more rigorously than may otherwise be needed for your base application, with the intent to support large queries with massive amounts of historical data.

Craig Trader