views:

42

answers:

2

I'm creating a forum app in php and have a question regarding database design:

I can get all the posts for a specific topic.All the posts have an auto_increment identity column as well as a timestamp.

Assuming I want to know who the topic starter was, which is the best solution?

  • Get all the posts for the topic and order by timestamp. But what happens if someone immediately replies to the topic. Then I have the first two posts with the same timestamp(unlikely but possible). I can't know who the first one was. This is also normalized but becomes expensive after the table grows.

  • Get all the posts for the topic and order by post_id. This is an auto_increment column. Can I be guaranteed that the database will use an index id by insertion order? Will a post inserted later always have a higher id than previous rows? What if I delete a post? Would my database reuse the post_id later? This is mysql I'm using.

  • The easiest way off course is to simply add a field to the Topics table with the topic_starter_id and be done with it. But it is not normalized. I believe this is also the most efficient method after topic and post tables grow to millions of rows.

What is your opinion?

+2  A: 

Zed's comment is pretty much spot on.

You generally want to achieve normalization, but denormalization can save potentially expensive queries.

In my experience writing forum software (five years commercially, five years as a hobby), this particular case calls for denormalization to save the single query. It's perfectly sane and acceptable to store both the first user's display name and id, as well as the last user's display name and id, just so long as the code that adds posts to topics always updates the record. You want one and only one code path here.

Charles
A: 

I must somewhat disagree with Charles on the fact that the only way to save on performance is to de-normalize to avoid an extra query.

To be more specific, there's an optimization that would work without denormalization (and attendant headaches of data maintenance/integrity), but ONLY if the user base is sufficiently small (let's say <1000 users, for the sake of argument - depends on your scale. Our apps use this approach with 10k+ mappings).

Namely, you have your application layer (code running on web server), retrieve the list of users into a proper cache (e.g. having data expiration facilities). Then, when you need to print first/last user's name, look it up in a cache on server side.

This avoids an extra query for every page view (as you need to only retrieve the full user list ONCE per N page views, when cache expires or when user data is updated which should cause cache expiration).

It adds a wee bit of CPU time and memory usage on web server, but in Yet Another Holy War (e.g. spend more resources on DB side or app server side) I'm firmly on the "don't waste DB resources" camp, seeing how scaling up DB is vastly harder than scaling up a web or app server.

And yes, if that (or other equally tricky) optimization is not feasible, I agree with Charles and Zed that you have a trade-off between normalization (less headaches related to data integrity) and performance gain (one less table to join in some queries). Since I'm an agnostic in that particular Holy War, I just go with what gives better marginal benefits (e.g. how much performance loss vs. how much cost/risk from de-normalization)

DVK
Scaling up MySQL is not particularly difficult, actually - it's pretty simple to introduce a replicated slave or two to take up the extra read load, and to direct heavier queries to run on a slave.
Rob