Please give your opinion of situations in which it is a good idea to NOT normalize. I just witnessed some heated discussions between an architech and DBA who insisted in which one was arguing the database was TOO normalized.
Performance an query possibilities: If the DB is too normalized, this may result in a lot of joins in your queries, and limits your possibilities to search on specific attributes. When doing DB design, you should consider the way you plan to use it, by doing access path analysis.
To elaborate, It's a rule of thumb to normalize data that is updated frequently, and denormalize data that is mostly read.
Normalization eliminates redundancy, but if it slows down performance in a way (due to all the required joins) the costs for the hardware are inadequate it is time to allow redundancy for performance sake. That's my rule of thumb. Same in case of long answer times.
Data warehousing often uses a non-normalized approach for performance reasons. Per wikipedia:
A standard piece of database design guidance is that the designer should create a fully normalized design; selective denormalization can subsequently be performed for performance reasons. However, some modeling disciplines, such as the dimensional modeling approach to data warehouse design, explicitly recommend non-normalized designs, i.e. designs that in large part do not adhere to 3NF.
You have to find the sweet spot... Being too normalized, you end up with a lot of "bloaty" abstract structures which contain just 1 or 2 columns of data and you end up joining 5 tables for most queries.
Being under-normalized, you end up having a lot of data in a lot of different places. This can cause the DB to be slower because of cache size and such. Also, now when you need to update something, you have 4 different tables to update instead of 1, and don't even get me started on making sure all the data in different places matches up!
Basically, pick your poison, and watch how your DB will be used, and be sane about it. All rules were meant to be broken, and if you have a piece of data in maybe 2 places that are very commonly accessed, I would say that's ok. It's a critical piece where joining (possibly more than 2) tables can just be too expensive.. But also don't micro-optimize for space or speed of the database.
When you're optimizing prematurely
Some normalization is there to allow future growth, and you may not need it.
For example, imagine you have a person
table. You can have birthday
as a column, because each person will only ever have one birthday.
If you're strictly normalizing, you wouldn't have phone_number
and cell_number
and fax_number
as columns in person
, but might instead have a phonenumber
table where each row has a number, a type, and a person_id relation. This can be better than just sticking new columns in the person
table, because
- Many people will not have all of those, leaving a lot of blank spaces, and
- If someone has 3 cell numbers, you'd be adding awkward columns like
cell_number_2
Concern #1 is valid, but concern #2 may be an example of "You ain't gonna need it." It's valid to say "we only allow one cell number, and that's that." In that case, you might not bother making a separate table for phone numbers.
It's a tradeoff. By not making the separate table, you're not strictly normalizing, and you might have a lot of NULL spaces. But you also have fewer joins to do, and it's just less work.
Like many good practices, normalization can become an end in itself - a gold badge you give yourself privately, because you Did It Right. And that's fine. But it's good to realize that the rules can be bent sometimes in the interest of keeping things simple.
One last thing: you have to weigh the fact that it sucks to change your db schema once your code is up and running. So it's OK to say "we ain't gonna need it," but try to be pretty sure before you commit.
Several good answers already on storage vs performance, but in addition I'd add the nugget that one further indication for me that de-normalization should be considered is where you're going to require queries using self-joins.
Conceptually of course there's nothing wrong with self-joining tables, but experience shows that it's one of the more difficult concepts for inexperienced programmers to grasp and consequently tends to produce bugs. You'll most likely ease the future maintenance path if you can design the need for these out.
It's a matter of judgement of course and so is an indication, not a rule.
The rule is normalize til it hurts, then denormalize til it works. (who said that?)
In general, I often denormalize when I have a lot of parent child relationships and I know I would often have to join to five or six large tables to get one piece of data (say the client id for instance) and will not need any of the information from the intermediate tables much of the time. If at all possible, I try to denormalize things that will not change frequently (such as id fields). But anytime you denormalize, you have to write triggers or some other process (but normally triggers if it isn't something that can be handled through a PK/FK relationship and cascading updates) to make sure the data stays in synch. If you fail to do this at the database level, then you will have data integrity problems and your data becomes useless. Do not think you can maintain the denormalization through the application code. This is a recipe for disaster, as database are updated often from places other than the application.
Denormalizing correctly can slow inserts, updates and deletes, especially if you need to do large batches of data. It may or may not improve select query speed depending on how you need to query the data. If you end up needing to do a lot of self-joins to get the data, it is possible you would have been better off not denormalizing. Never denormalize without testing to see if you have improved performance. Remember slowing inserts/updates/deletes will have an overall effect on the system when many users are using it. By denormalizing to fix one problem, you may be introducing a worse problem in the overall system. Don't just test the one query you are trying to speed up, test the performance of the whole system. You might speed up a query that runs once a month and slow down other qreries that run thousands of times a day.
Denormalizing is often done for data warehouses which are a special case as they are generally updated automatically on a schedule rather than one record at a time by a user. DBAs who specialize in data warehousing also tend to build them and they know how to avoid the data integrity issues.
Another common denormalizing technique is to create a staging table for data related to a complex report that doesn't need to be run with real time data. This is a sort of poor man's data warehouse and should never be done without a way to update the staging table on a schedule (As infrequently as you can get away with, this uses server resources that could be better spend elsewhere most of the time.) Often these types of table are updated when there are few users on the system and lag a full day behind the real time data. Don't consider doing this unless the query you are staging the data for is truly slow and cannot otherwise be optimized. Many slow queries can be optimized without denomalization as developers often use the easiest to understand rahter than the most performant ways to select data.
Worked in a place where they normalized too much. They removed the 'state' column from a mailing address table. In place of a 2 byte state column they put an integer foreign key field linking to a state table.
In summary:
They replaced a 2 byte state column with a 4 byte column in the address table. Now every row takes 2 more bytes of storage.
They added a state table with a 4 byte primary key column and a 2 byte state column. Takes up more space to store this table.
The database keeps an btree index of the keys in the state table. Takes up more space.
The sql to retrieve addresses is harder to write.
The sql to retrieve addresses is slower than the original.
Sure, this naively removes duplicated unchanging data. The result is that it uses more disk space, is harder to use, and is slower to use.
You definitely can normalize too much.
Don't normalize when you are following a design discipline that results in a less than fully normalized schema. One such design discipline is star schema design, and a close variant of it is snowflake schema.
Both star and snowflake will result in a schema that's far easier to use for a variety of reports, customized extracts, and interface to an OLAP tool such as Cognos Power Play. The downside? Every departure from one of the normal forms (except 1NF) carries with it an anomaly when inserting/updating/deleting data. If you really know the normal forms, you'll know what the related anomalies are. When you write the ETL (extract, transform, and load) procedures to keep your star/snowflake current, you'll have to work around these anomalies.
So when is a star or snowflake schema a better idea than a normalized schema? Typically, for data warehouses, data marts, and reporting databases. In my own practice, I've never built one of these that was not a back end for an OLTP database, and the OLPT database benefits from nearly complete normalization. Don't just denormalize and abandon all discipline. That's like designing at random.