views:

723

answers:

6

Hello all,

I have to collect statisctics by days, weeks, months and years of user activity for a site. I am the DB design stage and I wanted to do this stage properly since it will make my coding life easier.

What I have to do is just simply increment the values in the fields by 1 in the DB each time an activity happens. So then I can pull up the date by each day, each week, each month and year. How should my DB be structured? Apologies if this is a simple question for most. It would also be great if this structure could be extendable so that it can be broken down by other categories.

The bit am having trouble with is each month is made up of more days and these days change each calender year.

Thanks all for any help or direction.

Other info: Linux Machine, making use of PHP and MySQL

+8  A: 

Instead of updating counts per day, week etc. just INSERT a row into a table each time an activity happens like this:

insert into activities (activity_date, activity_info) 
values (CURRENT_TIMESTAMP, 'whatever');

Now your reports are very simple like:

select count(*) from activities
where activity_date between '2008-01-01' and '2008-01-07';

or

select YEARWEEK(`activity_date`) as theweek, count(*)
group by theweek
Tony Andrews
Wouldn't this table get very large if for example the activities being recorded happen every time a page is loaded of a website and there are many users that hold accounts with this website?
Abs
Yes. This is what you should start with as an initial design. Optimization can come later perhaps.
thomasrutter
Agree with Thomas. Also, look at the usefulness of this information - it can answer a lot more questions than you have currently thought of - e.g. average number of distinct users/period, average page views per user, ... You can always archive older data if space becomes a problem.
Tony Andrews
For performance you can leverage the MyISAM database engine for the table that will hold this data. Just be aware of the dangers: http://is.gd/lKZ0 and http://is.gd/lKZe
epochwolf
You might be better using the Archive storage engine.
GateKiller
+3  A: 

You may just add records into the table and SELECT them using aggregate functions.

If for some reason you need to keep aggregated statistics, you may use:

CREATE TABLE aggregates (type VARCHAR(20), part VARCHAR(10) NOT NULL PRIMARY KEY, activity INT)

INSERT INTO aggregates (type, part, activity)
VALUES ('year', SUBSTRING(SYSDATE(), 1, 4), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1

INSERT INTO aggregates (type, part, activity)
VALUES ('month', SUBSTRING(SYSDATE(), 1, 7), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1

INSERT INTO aggregates (type, part, activity)
VALUES ('day', SUBSTRING(SYSDATE(), 1, 10), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1

This will automatically update existing rows and insert non-existing when needed.

Quassnoi
+1  A: 
  1. table of events : id, activity id, datetime, userid.
  2. table of users : id, username etc
  3. table of activities : id, activity name, etc

Just enter a new row into events when an event happens. Then you can analyse the events but manipulating time, date, user, activity etc.

PaulBM
+1  A: 

To start with, you would probably imagine a single table, as this would be the most normalized form. The table would simply have an entry for each hit you receive, with each row containing the date/time of that hit.

Now, this way, in order to get statistics for each hour, day, week etc, the queries are simple but your database will have to do some pretty heavy query work. In particular, queries that do sums, counts or averages will need to fetch all the relevant rows.

You could get around this by precalculating the required counts in a second table, and making sure you sync that table to the first regularly. Problem is, you will be responsible for keeping that cache in sync yourself.

This would probably involve making a row for each hour. It will still be a lot quicker to do a query for a day, or a month, if you are only fetching a maximum of 24 rows per day.

Your other suggestion was to aggregate it from the start, never storing every single hit as a row. You would probably do that, as before, with a row for each hour. Every hit would increment the relevant hours' row by one. You would only have the data in one location, and it would already be pretty well summarised.

The reason I suggest by hour instead of by day, is that this still gives you the option to support multiple time zones. If your granularity is only to the day, you don't have that option.

thomasrutter
Ah, Thank you for this explanation. I think you realized I was a half pint when it comes to DB Design. Thanks! I will go for the simpler solution at first.
Abs
A: 

Tony Andrews' answer is the simplest, however a snowflake structure is sometimes used in data warehouse applications: a table that counts all the activities, another for activities per day, another for activities per month, and a third for activities per year. With this kind of structure, the activity between any two dates can be computed very efficiently. https://secure.wikimedia.org/wikipedia/en/wiki/Snowflake_schema

Maurice Perry
A: 

Use a star schema design. (or perhaps a snowflake design).

http://stackoverflow.com/questions/110032/star-schema-design

You will end up doing an insert into a fact table for each new activity. See Tony's suggestion.

You will need at least two dimension tables, one for users and one for time frames. There will probably be dimensions for activity type, and maybe even for location. It depends on what you want to do with the data.

Your question relates to the time frames dimension table. Let's call it "Almanac". Choose a granularity. Let's say the day. The almanac will have one row per day. The primary key can be the date. Your fact table should include this primary key as a foreign key, in order to make joins easier. (It doesn't matter whether or not you declare it as a foreign key. That only affects referential integrity during your update process.)

Include columns in the Almanac for each reporting period you can think of. Week, Month, Quarter, Year, etc. You can even include reporting periods that relate to your company's own calendar.

Here's an article comparing ER and DM. I'm unusual in that I like both methods, choosing the appropriate method for the appropriate task.

http://www.dbmsmag.com/9510d05.html

Walter Mitty