views:

359

answers:

3

Hi,

I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.

The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).

For example, I have the following zones:

  1. Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
  2. Category - self descriptive.
  3. Minisite - self descriptive.
  4. Product Image - whenever user sees a product and the lead submission form.

Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.

I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.

My questions:

  1. How large scale data parsing applications - like Google Analytics load the data so fast?
  2. What is the best way for me to do it?
  3. Maybe my DB design is wrong and I should store the data in only one table?

Thanks for anyone that helps,

Eytan.

+3  A: 

The basic approach you're looking for is called aggregation.

You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.

A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.

Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.

David Schmitt
+1  A: 

Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.

In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).

This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).

This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.

SquareCog
+2  A: 

It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.

You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.

AndrewDotHay