views:

94

answers:

3

I am tracking clicks over three time periods: the past day, past week and past month.

To do this, I have three tables:

  • An hourly table, with columns link_id, two other attributes, and hour_1 to hour_24, together with a computed column giving the sum

  • A weekday table, with columns click_id, two other attributes, and day_1 to day_7, together with a computed column giving the sum

  • A monthday table, as above, with columns day_1 to day_31

When a click comes in, I store its key attributes like href, description, etc, in other tables, and insert or update the row(s) corresponding to the link_id in each of the above tables.

Each link can have several entries in each of the above hourly/weekday/monthday tables, depending on the two other attributes (e.g. where the user is sitting).

So if a user is Type A and sitting in X, three rows are created or added to in the above tables -- the first row records all clicks on that link over the time period, the second row records all clicks by "Type A people", and the third "All clicks by people in X".

I have designed it this way as I didn't want to have to move data around each hour/day/week/month. I just maintain pointers for "current hour" (1-24), "current day" (1-31) and "current weekday" (1-7), and write to the corresponding cells in the tables. When we enter a new period (e.g. "3pm-4pm"), I can just blank out that current column (e.g. hour_15), then start incrementing it for links as they come in. Every so often I can delete old rows which have fallen down to "all zero".

This way I shouldn't ever have to move around column data, which would likely be very expensive for what will potentially be tens of thousands of rows.

I will only be SELECTing either the current day/weekday/hour rows (prior to inserting/updating) or the TOP 20 values from the computed columns based on the attributes (and will likely cache these results for an hour or so).

After the tables populate, UPDATES will far exceed INSERTs as there aren't that many unique hrefs.

Three questions:

  • Is it OK to combine the three big tables into one big table of monthdays/weekdays/hours? This would give a table with 64 columns, which I'm not sure is overkill. On the other hand, keeping them separate like they are now triples the number of INSERT/UPDATE statements needed. I don't know enough about SQL server to know which is best.

  • Is this approach sensible? Most data sets I've worked with of course have a separate row per item and you would then sort by date -- but when tracking clicks from thousands of users this would give me many hundreds of thousands of rows, which I would have to cull very often, and ordering and summing them would be hideous. Once the tracker is proven, I have plans to roll the click listener out over hundreds of pages, so it needs to scale.

  • In terms of design, clearly there is some redundancy in having both weekdays and monthdays. However, this was the only way I could think of to maintain a pointer to a column and quickly update it, and use a computed column. If I eliminated the weekdays table, I would need to get an additional computed column on the "monthdays" that summed the previous 7 days -- (e.g. if today is the 21st, then sum day_14, day_15, day_16... day_20). The calculation would have to update every day, which I imagine would be expensive. Hence the additional "weekday" table for a simple static calculation. I value simple and fast calculations more highly than small data storage.

Thanks in advance!

+2  A: 

Denormalization as you have done in your database can be a good solution for some problems. In your case however I would not choose the above solution mainly because you lose information that you might need in the future, maybe you want to report on half-hour intervals in the future. So looking at your description you could do with only 2 tables: Links (ahref's and descriptions) and clicks on the links (containing the date and time of the click and maybe some other data). The drawback of course is that you have to store hunderds of thousands of records and querying this amount of data can take a lot of time. If this is the case you might consider storing aggregate data on these 2 tables in separate tables and update these tables on a regular basis.

Geert Immerzeel
+4  A: 

Anytime you see columns with numbers in their names, such as column_1, column_2, column_3... your 'horrible database design' flag should raise. (FYI, here you are breaking 1NF, specifically you are repeating groups across columns)

Now, it is possible that such implementation can be acceptable (or even necessary) in production, but conceptually it is definitively wrong.

As Geert says, conceptually two tables will suffice. If the performance is an issue you could denormalize data for weekly/monthly stats, but still I would not model them as above but I would keep the

CREATE TABLE base_stats ( link_id INT, click_time DATETIME )
CREATE TABLE daily_stats ( link_id INT, period DATETIME, clicks INT )

You can always aggregate with

SELECT link_id, count(*) as clicks, DATE(click_time) as day
FROM base_stats
GROUP_BY link_id, day

which can be run periodically to fill the daily_stats. If you want to keep it up to date you can implement it in triggers (or if you really must, do it on the application side). You can also denormalize the data on different levels if necessary (by creating more aggregate tables, or by introducing another column in the aggregated data table), but that might be premature optimization.

The above design is much cleaner for future ad-hoc analysis (will happen with stats). For other benefits see wikipedia on repeating groups.

EDIT: Even though the solution with two tables base_stats and aggregated_stats is accepted, with following strategy:

  • insert each click in base_stats
  • periodically aggregate the data from base_stats into daily_stats and purge the full detail

it might not be the optimal solution. Based on discussions and clarification of requirements it seems that the table base_stats is not necessary. The following approach should be also investigated:

CREATE TABLE period_stats ( link_id INT, period DATETIME, ...)

Updates are easy with

UPDATE period_stats 
SET clicks = clicks + 1 
WHERE period = @dateTime AND link_id = @url AND ...

The cost of updating this table, properly indexed is as efficient as inserting rows in the base_table and any it is also easy to use it for analysis

SELECT link_id, SUM(clicks)
FROM period_stats
WHERE period between @dateTime1 AND @dateTime2
GROUP BY ...
Unreason
Thanks -- yes, this is what I started with. Flags very much raised, hence this post. I'm unlikely to want to do too much analysis on the data, but the more open it is the better. The key thing is performance. This is first steps towards a dynamic navigational model on our Intranet -- users will be able to browse by "Most popular with your colleagues", "most popular this week in your office", etc, rather than relying on a strict hierarchy -- a big change....
Jhong
... The click listener attaches itself to most links on a page, and with 10k users, there will be a huge stream of data coming in. If this slows the application server to a crawl, I will have trouble convincing people that this is the future.On the other hand, once this is proven on our homepage, I'll be pushing this as the way forward for most every page on the Intranet -- thousands of pages, including a Sharepoint instance, and many more users. I'm not interested in accurate analytics -- purely aggregates at this stage....
Jhong
... I prefer the normalized model, but how expensive is count(*) with group and order by on a table with half a million rows?Nothing substitutes testing, but I'm keen to get the design in the right ball-park.
Jhong
one table is enough !
iDevlop
@Patrick: Conceptually, you are right - it is enough (logical design) and it is more flexible/cleaner that way. However your exclamation is not justified: in physical design you are free to denormalize if it is justified - on 500k rows, and with application that is going to do mostly reads, implementation that will use triggers and maintain aggregates will put less load on the server for sure (while triggers will keep the data integrity strong). The question is what should be the ratio of reads vs. writes to justify the added complexity (but it is not really that complex).
Unreason
@Jhong, with index covering all of your group by partitions (order dependant), the `count(*)` will not be expensive - it can be calculated from the index only (in this sense index *is* like an aggregate table, allowing fast count, max, min, etc...). But, as you yourself pointed out nothing beats testing, and all of the above can be easily tested.
Unreason
Thanks. I've written an alternative DB layer using the "1 click per row" paradigm -- and it certainly makes my head hurt a lot less. There are less INSERTs per click now -- one for the click table, and possibly one for the links, depts and offices tables. Given that there will be a lot more writes to the DB than reads, this probably makes sense. Number of rows still worries me though. Say I get 5 clicks / second (likely that peak will be higher)... that's 3.6 million rows for a month of working days. I'm worried that this is too heavy for what should be a simple aggregating backend.
Jhong
@Jhong, ok, but I would not keep all of that detail if it is not necessary. Periodically you can agregate the data (and purge it from this one click one row) in a table similar to daily_stats (not really well named since the period can be whatever you choose).
Unreason
Yes, that makes sense. I'll pursue this route. Thanks.
Jhong
+2  A: 

That design is really bad. Unreason's proposal is better.
If you want to make it nice and easy, you could as well have a single table with 2 fields:

   timeSlice  
   clickCount  
   location
   userType 

with TimeSlice holding the date and time rounded to the hour. All the rest can be deducted from that, and you would have only
24 * 365 * locations# * types#
records per year.

Always depending on the configuration and feasibility, with this table design, you could eventually accumulate values in memory and only update the table once per 10 sec. or any time length <= 1 hour, depending on acceptable risk

iDevlop
Yes, however I was worried about the half a million clicks that would come streaming into the click table.That said, I guess inserting the click blindly is going to be a lot less expensive than my current method, which requires selecting current counts and incrementing columns. Selecting and counting the huge result set will be a killer, but I can cache that.
Jhong
Clicks streaming: at least (1) you'll easily locate the record, and (2) you might -if technically feasible- consider waiting for 10 clicks before incrementing your table, and (3) clicks will stream anyway: your architecture just multiplies them by 3.
iDevlop
There will be 24 * 365 records per year per number of links times number of users (colleagues, office) times other dimensions. As the links are already estimated at 10,000 you can easily run into millions records. Still with proper indexes this could be OK. Testing is the way to go.
Unreason
@Unreason: Ok for your remark on # records, , I updated table to better reflect the question, which I had not fully read. I still think, however, that 1 table is the way to go here, whatever the size or speed we have. If I had to add another table, it would only be for archiving records that are more then X days/months old.
iDevlop
@Patrick, Well, one or two tables depend if there are further requirements, during my initial answer it was not so clear if there are more requirements. Yes, of course: the table that would have a row for each click is redundant, if it is not necessary (sic). Also, I was aiming to show variations of design. With base_stats table (in memory) it is possible to do very simple inserts and run updates every x minutes on a persistent table, for example, without a need for a separate application layer or cache.
Unreason