views:

42

answers:

2

I have a need to give my users time-based statistics about different kinds of visitors to their pages, such as how many visitors in the last week / last month / overall, and perhaps where those visitors are coming from. Ideally, the data would be rich enough to allow for expansion/customization/re-aggregation if possible.

My question is how best to organize this in the database. One option would be to create a new record for each visit, containing the visitor's id and the date.

I suspect that such an approach, while giving me the functionality, would be storage-heavy. Is there a better way to approach this?

A: 

I would go for "a new record for each visit". Programming will be complicated, otherwise.

You can get an estimate of how much storage you need depending on the traffic of your site. Do this, and you might decide not to worry about storage because storage is not usually an issue for modern day DBMSs.

If storage is really a problem, then you need to decide (and fix) what statistics you need, beforehand. Then, you can record each visit for just a single month. At the end of each month, you summarize that month and add this summary to your existing summary which should have been compiled from the previous months' statistics.

eakbas
+2  A: 

The accepted industry practice is to have a trace-level table and summary-level tables. Trace data means detail log of events. Even though storage is not an issue, the time it takes to count would be an issue. It would be much quicker to query visit counts by the hour than counting the records of the trace table.

Let us say your trace-level table is

Trace = [datetime, userId, domain], ipaddr;

As your process inserts into the trace table, it would also create/update records for the summary table.

Determine a finest granularity of time you wish to accumulate events. It could be minute, hour, day, etc.

You also need to determine if you wish to count by session or by page request. Would you count a repeated visit on a page as more than one visit? You have to determine all such denominators.

Define a counter:

class Counter{
  datetime start;
  int count;
}

Let us say that you have determined the granularity of by-hour is the most appropriate for the traffic on the site.

Cumulating by session over the whole site
At the start of the hour, the counter is reset and the start datetime is set to the current datetime. count is incremented per event that occurs. In your case, it would be a new session event.

When the next hour arrives, the Counter data is written to a new record in the table

ByHour = [start], count;

And again, the counter is reset and the start datetime is set to the current datetime.

What you have now is a table that cumulates incidents of new visits by the hour. So that now you could select the sum of count over all ByHour records over the ranges of all ByHour records that fall within a given day, week, month, year of a specified range.

In order not to affect the response of the web service, the web service should write those events into a bus or queue, which would be picked up by a separate data loading process.

Cumulating per page visited
However, you might want to cumulate by pages

The cumulator process would need to maintain an instance of the cumulator class per pageId:

class Counter{
  datetime start;
  int pageId;
  int count;
}

And, the count record would be:

ByHour = [start, pageId], count;
PageID = [pageId], url;

Cumulating per userId, pageId, sessionId, clientIp, anything-goes
Then, you further determine that you need to cumulate visits per user id. In that case, you might determine that it would be better cumulate by day rather than by the hour. If you cumulated it by hour, you might as well depend on the trace table because counting the scarcity of a user starting a new session or visiting per hour would be as good a glanularity as depending on the trace table.

The cumulator process would need to maintain an instance of the cumulator class per paramId:

class Counter{
  datetime start;
  int paramId;
  int count;
}

And, the count record would be:

ByDay = [start, paramId], count;
ParamID = [paramId], paramName;

Where paramName could be "pageId", "userId", "clientIp", etc.

To prevent losing data, you could update any Counter instances that had changes to the ByDay table every ten minutes, without reseting the counter instances. Reset the start and count only when the day crosses over to a new day.

class Counter{
  datetime start;
  boolean modified;
  int paramId;
  int count;
}

The "modified" field is not written to the database table. It is a flag to determine if a Counter instance needs to update the ByDay table. If an update occurs, the "modified" field is reset to "false".

Blessed Geek