tags:

views:

26

answers:

1

I have three tables: actions, messages, likes. It defines inheritance, messages and likes are actions' childs (specialization).

Message and Like both have column userId and createdAt. Those should be of course moved to the parrent table Action and removed from Message and Likes. But there's only one case when I need to select both messages and likes from the database, in other cases I select only one of them, either messages or likes.

Is it ok to duplicate userId and createdAt in child and parrent table? It costs disk space but saves one join - I would have to join messages, likes with actions everytime I needed userId and createdAt. Whatsmore I would need to change my current code...

What would you suggest?

+1  A: 

In my opinion this is a case of premature optimization (or premature denormalization, if you prefer). You're guessing that the join overhead will cause significant problems, so you're guessing that duplicating the userId and createdAt columns in the dependent tables will improve performance significantly.

I suggest that you should not duplicate columns until you know there's a real problem. I keep a few observations on performance optimization tacked up on the wall to remind myself of what I should do in similar cases:

  1. It ain’t broke ‘til it’s broke.
  2. You can’t improve what you haven’t measured.
  3. Programs spend surprising amounts of time in the damnedest places.
  4. Make it run. Make it run right. Make it run right fast.
    • optimization is literally the last thing you should be doing.
    • doing things wrong faster is no great benefit.

Also a few comments on denormalization:

  1. You can’t denormalize that which is not normalized.
  2. Most developers wouldn’t know third-normal form if it leapt out from behind their screen, screamed like a banshee, and cracked a baseball bat over their heads.
  3. Denormalization is suggested as a panacea for database performance issues. The problem is that too often those recommending denormalization have never normalized anything.
  4. “Denormalization for performance reasons” is an excuse for sloppy, “do what we’ve always done” thinking, especially when such denormalization is enshrined in the design.

In my experience, I am not able to identify where performance problems will occur before writing code. Problems always seem to occur in places where I would never have thought to look. Thus, I've found that my best choice is always to write the simplest, clearest code that I can and to design the database as simply as I can, following the normalization rules to the best of my ability, and then to deal with what turns up. There may still be performance issues which need attention (but, surprisingly, not really all that often), but in the end I'll end up with simple, clear, and easily understood/maintained code, running on a simple, well-designed database.

Share and enjoy.

Bob Jarvis
Thanks, great post. Basicaly, I should get rid off the duplication and rewrite the legacy code? :-)
PetrB
@PetrB: Thanks. If there's legacy code involved it's clearly a different situation, and you may be not be able (due to time constraints or organizational constraints) to rewrite everything. My comments were directed to new database and code.
Bob Jarvis
It's my personal project, so time or anything else doesn't matter and the related legacy code isn't such a big issue either. From your post it's clear that it's better to get data by joining the parent table in all cases even if I need just the one child.
PetrB