views:

4544

answers:

8

Why do database guys go on about normalisation?
What is it?
How does it help?
Is it even that important?
Does it apply to anything outside of databases?

A: 

It helps prevent duplicate (and worse, conflicting) data.

Can have negative impact on performance though.

Brian Knoblauch
Having worked with both normalized and non normalized data, I prefer a speed drop with normalization rather than loosing or have a difficulty to maintain the application or database.
Schalk Versteeg
modern database engines employ caching, which often makes normalized databases more efficient than un-normalized one. if in doubt, measure.
Steven A. Lowe
A denormalized design can be faster for a particular query, but a normalized design offers a compromise by giving reasonable performance for a much wider variety of queries.
Bill Karwin
@Bill, I'd have to disagree somewhat. The only way a fully normalized database helps performance is by preventing the system from having to deal with redundant data. Other than that, it's a worst case situation from a performance standpoint.
Brian Knoblauch
+16  A: 

Most importantly it serves to remove duplication from the database records. For example if you have more than one place (tables) where the name of a person could come up you move the name to a separate table and reference it everywhere else. This way if you need to change the person name later you only have to change it in one place.

It is crucial for proper database design and in theory you should use it as much as possible to keep your data integrity. However when retrieving information from many tables you're losing some performance and that's why sometimes you could see denormalised database tables (also called flattened) used in performance critical applications.

My advise is to start with good degree of normalisation and only do de-normalisation when really needed

P.S. also check this article: http://en.wikipedia.org/wiki/Database_normalization to read more on the subject and about so-called normal forms

Ilya Kochetov
You'd also be quite surprised how little denormalisation is really needed in transactional apps. In one monster application I did the data model for, a schema with 560 tables had only 4 items of denormalised data.
ConcernedOfTunbridgeWells
It prevents "update anomalies". It does this by eliminating certain kinds of duplication.
S.Lott
"My advise is to start with good degree of normalisation and only do de-normalisation when really needed". This one advice is a very bad one! I still did not see any proper illustration of this "pseudo-theory". Minus 1.
Philippe Grondier
+1  A: 

Normalization is one of the basic concepts. It means that two things do not influence on each other.

In databases specifically means that two (or more) tables do not contain the same data, i.e. do not have any redundancy.

On the first sight that is really good because your chances to make some synchronization problems are close to zero, you always knows where your data is, etc. But, probably, your number of tables will grow and you will have problems to cross the data and to get some summary results.

So, at the end you will finish with database design that is not pure normalized, with some redundancy (it will be in some of the possible levels of normalization).

Nenad
+2  A: 

Normalization a procudure used to eliminate redundancy and functional dependencies between columns in a table.

There exist several normal forms, generally indicated by a number. A higher number means fewer redundancies and dependencies. Any SQL table is in 1NF (first normal form, pretty much by definition) Normalizing means changing the schema (often partitioning the tables) in a reversible way, giving a model which is functioanlly identical, except with less redundancy and dependencies.

Redundancy and dependency of data is undesirable because it can lead to inconstistencies when modifying the data.

Rik
+3  A: 

It is intended to reduce redundancy of data.

For a more formal discussion, see the Wikipedia http://en.wikipedia.org/wiki/Database_normalization

I'll give a somewhat simplistic example.

Assume an organization's database that usually contains family members

id, name, address
214 Mr. Chris  123 Main St.
317 Mrs. Chris 123 Main St.

could be normalized as

id name familyID
214 Mr. Chris 27
317 Mrs. Chris 27

and a family table

ID, address
27 123 Main St.

Near-Complete normalization (BCNF) is usually not used in production, but is an intermediate step. Once you've put the database in BCNF, the next step is usually to De-normalize it in a logical way to speed up queries and reduce the complexity of certain common inserts. However, you can't do this well without properly normalizing it first.

The idea being that the redundant information is reduced to a single entry. This is particularly useful in fields like addresses, where Mr. Chris submits his address as Unit-7 123 Main St. and Mrs. Chris lists Suite-7 123 Main Street, which would show up in the original table as two distinct addresses.

Typically, the technique used is to find repeated elements, and isolate those fields into another table with unique ids and to replace the repeated elements with a primary key referencing the new table.

chris
BCNF is not "perfect". Higher normal forms exist, up to 6NF, where all your tables are just a key and a data value. It's rarely ever used, though
Rik
Thanks, I editted it to reflect that fact.
chris
+34  A: 

Normalization is basically to design a database schema such that duplicate and redundant data is avoided. If some piece of data is duplicated several places in the database, there is the risk that it is updated in one place but not the other, leading to data corruption.

There is a number of normalization levels from 1. normal form through 5. normal form. Each normal form describes how to get rid of some specific problem, usually related to redundancy.

Some typical normalization errors:

(1) Having more than one value in a cell. Example:

UserId | Car
---------------------
1      | Toyota
2      | Ford,Cadillac

Here the "Car" column (which is a string) have several values. That offends the first normal form, which says that each cell should have only one value. We can normalize this problem away by have a separate row per car:

UserId | Car
---------------------
1      | Toyota
2      | Ford
2      | Cadillac

The problem with having several values in one cell is that it is tricky to update, tricky to query against, and you cannot apply indexes, constraints and so on.

(2) Having redundant non-key data (ie. data repeated unnecessarily in several rows). Example:

UserId | UserName | Car
-----------------------
1      | John     | Toyota
2      | Sue      | Ford
2      | Sue      | Cadillac

This design is a problem because the name is repeated per each column, even though the name is always determined by the UserId. This makes it theoretically possible to change the name of Sue in one row and not the other, which is data corruption. The problem is solved by splitting the table in two, and creating a primary key/foreign key relationship:

UserId(FK) | Car               UserId(PK) | UserName
---------------------          -----------------
1          | Toyota            1          | John
2          | Ford              2          | Sue
2          | Cadillac

Now it may seem like we still have redundant data because the UserId's are repeated; However the PK/FK constraint ensures that the values cannot be updated independently, so integrity is safe.

Is it important? Yes, it is very important. By having a database with normalization errors, you open the risk of getting invalid or corrupt data into the database. Since data "lives forever" it is very hard to get rid of corrupt data when first it has entered the database.

Don't be scared of normalization. The official technical definitions of the normalization levels are quite obtuse. It makes it sound like normalization is a complicated mathematical process. However, normalization is basically just the common sense, and you will find that if you design a database schema using common sense it will typically be fully normalized.

There are a number of misconceptions around normalization:

  • some believe that normalized databases are slower, and the denormalization improves performance. This is only true in very special cases however. Typically a normalized database is also the fastest.

  • sometimes normalization is described as a gradual design process and you have to decide "when to stop". But actually the normalization levels just describe different specific problems. The problem solved by normal forms above 3rd NF are pretty rare problems in the first place, so chances are that your schema is already in 5NF.

Does it apply to anything outside of databases? Not directly, no. The principles of normalization is quite specific for relational databases. However the general underlying theme - that you shouldn't have duplicate data if the different instances can get out of sync - can be applied broadly. This is basically the DRY principle.

JacquesB
Nicely spelled out. Thanks for taking the time. :)
AJ
The example you gave for first normal isn't exactly correct. I always remember the first three normal forms by the terms repeating, redundant, non-dependent. Repeating data refers to when novice database developers write table defs that include columns like DogName1, DogName2, DogName3, etc.
Bill
@Bill: Why do you think the example I provide is not correct? Do you know of a definition of 1NF where the example would be OK?
JacquesB
+12  A: 

The rules of normalisation (source: unknown)

  • The key (1NF)
  • The whole key (2NF)
  • and nothing but the key (3NF)

... So help me Codd.

ConcernedOfTunbridgeWells
I think this is a little vague without a proper context, don't you?
Rik
It may be a little vague, but it's a great reminder for those with the context. I know what normalization is and how to go about it, but I can never remember what each of the forms are.
toast
BCNF: http://en.wikipedia.org/wiki/Boyce-Codd_normal_form
Even Mien
+1  A: 

Quoting CJ Date: Theory IS practical.

Departures from normalization will result in certain anomalies in your database.

Departures from First Normal Form will cause access anomalies, meaning that you have to decompose and scan individual values in order to find what you are looking for. For example, if one of the values is the string "Ford, Cadillac" as given by an earlier response, and you are looking for all the ocurrences of "Ford", you are going to have to break open the string and look at the substrings. This, to some extent, defeats the purpose of storing the data in a relational database.

The definition of First Normal Form has changed since 1970, but those differences need not concern you for now. If you design your SQL tables using the relational data model, your tables will automatically be in 1NF.

Departures from Second Normal Form and beyond will cause update anomalies, because the same fact is stored in more than one place. These problems make it impossible to store some facts without storing other facts that may not exist, and therefore have to be invented. Or when the facts change, you may have to locate all the plces where a fact is stored and update all those places, lest you end up with a database that contradicts itself. And, when you go to delete a row from the database, you may find that if you do, you are deleting the only place where a fact that is still needed is stored.

These are logical problems, not performance problems or space problems. Sometimes you can get around these update anomalies by careful programming. Sometimes (often) it's better to prevent the problems in the first place by adhering to normal forms.

Notwithstanding the value in what's already been said, it should be mentioned that normalization is a bottom up approach, not a top down approach. If you follow certain methodologies in your analysis of the data, and in your intial design, you can be guaranteed that the design will conform to 3NF at the very least. In many cases, the design will be fully normalized.

Where you may really want to apply the concepts taught under normalization is when you are given legacy data, out of a legacy database or out of files made up of records, and the data was designed in complete ignorance of normal forms and the consequences of departing from them. In these cases you may need to discover the departures from normalization, and correct the design.

Warning: normalization is often taught with religious overtones, as if every departure from full normalization is a sin, an offense against Codd. (little pun there). Don't buy that. When you really, really learn database design, you'll not only know how to follow the rules, but also know when it's safe to break them.

Walter Mitty