ansaurus

Question

Combine three tables into one, or too many columns?

Answer 1

+2 A:

Denormalization as you have done in your database can be a good solution for some problems. In your case however I would not choose the above solution mainly because you lose information that you might need in the future, maybe you want to report on half-hour intervals in the future. So looking at your description you could do with only 2 tables: Links (ahref's and descriptions) and clicks on the links (containing the date and time of the click and maybe some other data). The drawback of course is that you have to store hunderds of thousands of records and querying this amount of data can take a lot of time. If this is the case you might consider storing aggregate data on these 2 tables in separate tables and update these tables on a regular basis.

Geert Immerzeel 2010-07-20 07:17:41

Answer 2

+4 A:

Anytime you see columns with numbers in their names, such as column_1, column_2, column_3... your 'horrible database design' flag should raise. (FYI, here you are breaking 1NF, specifically you are repeating groups across columns)

Now, it is possible that such implementation can be acceptable (or even necessary) in production, but conceptually it is definitively wrong.

As Geert says, conceptually two tables will suffice. If the performance is an issue you could denormalize data for weekly/monthly stats, but still I would not model them as above but I would keep the

CREATE TABLE base_stats ( link_id INT, click_time DATETIME )
CREATE TABLE daily_stats ( link_id INT, period DATETIME, clicks INT )

You can always aggregate with

SELECT link_id, count(*) as clicks, DATE(click_time) as day
FROM base_stats
GROUP_BY link_id, day

which can be run periodically to fill the daily_stats. If you want to keep it up to date you can implement it in triggers (or if you really must, do it on the application side). You can also denormalize the data on different levels if necessary (by creating more aggregate tables, or by introducing another column in the aggregated data table), but that might be premature optimization.

The above design is much cleaner for future ad-hoc analysis (will happen with stats). For other benefits see wikipedia on repeating groups.

EDIT: Even though the solution with two tables base_stats and aggregated_stats is accepted, with following strategy:

insert each click in base_stats
periodically aggregate the data from base_stats into daily_stats and purge the full detail

it might not be the optimal solution. Based on discussions and clarification of requirements it seems that the table base_stats is not necessary. The following approach should be also investigated:

CREATE TABLE period_stats ( link_id INT, period DATETIME, ...)

Updates are easy with

UPDATE period_stats 
SET clicks = clicks + 1 
WHERE period = @dateTime AND link_id = @url AND ...

The cost of updating this table, properly indexed is as efficient as inserting rows in the base_table and any it is also easy to use it for analysis

SELECT link_id, SUM(clicks)
FROM period_stats
WHERE period between @dateTime1 AND @dateTime2
GROUP BY ...

Unreason 2010-07-20 10:12:56

Thanks -- yes, this is what I started with. Flags very much raised, hence this post. I'm unlikely to want to do too much analysis on the data, but the more open it is the better. The key thing is performance. This is first steps towards a dynamic navigational model on our Intranet -- users will be able to browse by "Most popular with your colleagues", "most popular this week in your office", etc, rather than relying on a strict hierarchy -- a big change....

Jhong 2010-07-20 11:30:53

... The click listener attaches itself to most links on a page, and with 10k users, there will be a huge stream of data coming in. If this slows the application server to a crawl, I will have trouble convincing people that this is the future.On the other hand, once this is proven on our homepage, I'll be pushing this as the way forward for most every page on the Intranet -- thousands of pages, including a Sharepoint instance, and many more users. I'm not interested in accurate analytics -- purely aggregates at this stage....

Jhong 2010-07-20 11:34:59

... I prefer the normalized model, but how expensive is count(*) with group and order by on a table with half a million rows?Nothing substitutes testing, but I'm keen to get the design in the right ball-park.

Jhong 2010-07-20 11:35:36

one table is enough !

iDevlop 2010-07-20 11:55:38

@Patrick: Conceptually, you are right - it is enough (logical design) and it is more flexible/cleaner that way. However your exclamation is not justified: in physical design you are free to denormalize if it is justified - on 500k rows, and with application that is going to do mostly reads, implementation that will use triggers and maintain aggregates will put less load on the server for sure (while triggers will keep the data integrity strong). The question is what should be the ratio of reads vs. writes to justify the added complexity (but it is not really that complex).

Unreason 2010-07-20 12:18:46

@Jhong, with index covering all of your group by partitions (order dependant), the `count(*)` will not be expensive - it can be calculated from the index only (in this sense index *is* like an aggregate table, allowing fast count, max, min, etc...). But, as you yourself pointed out nothing beats testing, and all of the above can be easily tested.

Unreason 2010-07-20 12:21:49

Thanks. I've written an alternative DB layer using the "1 click per row" paradigm -- and it certainly makes my head hurt a lot less. There are less INSERTs per click now -- one for the click table, and possibly one for the links, depts and offices tables. Given that there will be a lot more writes to the DB than reads, this probably makes sense. Number of rows still worries me though. Say I get 5 clicks / second (likely that peak will be higher)... that's 3.6 million rows for a month of working days. I'm worried that this is too heavy for what should be a simple aggregating backend.

Jhong 2010-07-20 17:30:25

@Jhong, ok, but I would not keep all of that detail if it is not necessary. Periodically you can agregate the data (and purge it from this one click one row) in a table similar to daily_stats (not really well named since the period can be whatever you choose).

Unreason 2010-07-20 17:52:13

Yes, that makes sense. I'll pursue this route. Thanks.

Jhong 2010-07-20 18:13:57

Answer 3

+2 A:

That design is really bad. Unreason's proposal is better.
If you want to make it nice and easy, you could as well have a single table with 2 fields:

   timeSlice  
   clickCount  
   location
   userType

with TimeSlice holding the date and time rounded to the hour. All the rest can be deducted from that, and you would have only
24 * 365 * locations# * types#
records per year.

Always depending on the configuration and feasibility, with this table design, you could eventually accumulate values in memory and only update the table once per 10 sec. or any time length <= 1 hour, depending on acceptable risk

iDevlop 2010-07-20 10:30:46

Yes, however I was worried about the half a million clicks that would come streaming into the click table.That said, I guess inserting the click blindly is going to be a lot less expensive than my current method, which requires selecting current counts and incrementing columns. Selecting and counting the huge result set will be a killer, but I can cache that.

Jhong 2010-07-20 11:42:44

Clicks streaming: at least (1) you'll easily locate the record, and (2) you might -if technically feasible- consider waiting for 10 clicks before incrementing your table, and (3) clicks will stream anyway: your architecture just multiplies them by 3.

iDevlop 2010-07-20 11:53:57

There will be 24 * 365 records per year per number of links times number of users (colleagues, office) times other dimensions. As the links are already estimated at 10,000 you can easily run into millions records. Still with proper indexes this could be OK. Testing is the way to go.

Unreason 2010-07-20 12:27:25

@Unreason: Ok for your remark on # records, , I updated table to better reflect the question, which I had not fully read. I still think, however, that 1 table is the way to go here, whatever the size or speed we have. If I had to add another table, it would only be for archiving records that are more then X days/months old.

iDevlop 2010-07-20 12:43:00

@Patrick, Well, one or two tables depend if there are further requirements, during my initial answer it was not so clear if there are more requirements. Yes, of course: the table that would have a row for each click is redundant, if it is not necessary (sic). Also, I was aiming to show variations of design. With base_stats table (in memory) it is possible to do very simple inserts and run updates every x minutes on a persistent table, for example, without a need for a separate application layer or cache.

Unreason 2010-07-20 13:07:03

ansaurus

tags:

views:

answers:

Combine three tables into one, or too many columns?

related questions