Overnormalization

views:

682

answers:

+10 Q:

Overnormalization

When would a database design be described as overnormalized? Is this characterization an absolute one? Or is it dependent on the way it is used in the application? Thanks.

+3 A:

When the performance cost exceeds the benefit towards the application's intended purpose.

MarkB 2008-11-15 17:41:51

I always liked the phrase "Normalize 'til it hurts, denormalize 'til it works." :)

vfilby 2008-11-15 18:32:37

Exactly - the perfect balance.

MarkB 2008-11-16 01:35:22

A very nice statement vfilby. It summarizes my comment below in one clear and simple sentence. :)

Eigir 2008-11-16 16:35:29

+15 A:

In the general sense, I think that overnormalized is when you are doing so many JOINs to retrieve data that it is causing notable performance penalties and deadlocks on your database, even after you've tuned the heck out of your indexes. Obviously, for huge applications and sites like MySpace or eBay, de-normalization is a scaling requirement.

As a developer for several small businesses, I tell you that in my experience it's always been easier to go from normalized -> denormalized than the other way around, and in fact going the other way around (to avoid duplication of data now that the business requirements have changed a year or so later) is much more difficult.

When I read general statements such as "you should put the address in your customers table instead of a separate address table so you can avoid the join", I shudder, because you just know that a year from now somebody's going to ask you to do something with addresses that you totally didn't foresee, like maintaining an audit trail, or storing multiple per customer. If your database allows you to create an indexed view, you can sidestep that issue until you get to the point where your dataset is so large that it can't possibly exist or be served by a single server or set of servers in a 1-write, many-read environment. For most of us, I don't think that scenario happens very often.

When in doubt, I aim for third normal form with some exceptions (for example, having a field contain a CSV-list of separated strings because I know I'll never ever look at the data from the other angle). When I need to consolidate, I'll look at my views or indexes first. Hope this helps.

Nicholas Piasecki 2008-11-15 17:44:13

You should aim for at least BCNF (basically a version of 3NF that eliminates an edge case that the official 3NF does not), and very often you'll find that the data is in fact in 5NF at that point anyway.

Jonathan Leffler 2008-11-16 04:58:27

+11 A:

It's always a question of the application domain. It's usually a question of correctness, but occasionally a question of performance.

There's one case where I can think of a prima facie case of overnormalization: say you have an order + orderitem, and the orderitem references productID, and leaves pricing to the product.price. Since that introduces temporal coupling, you've incorrectly normalized because the overnormalization affects already shipped orders, unless prices absolutely never change. You can certainly argue that this is simply a modeling error (as in the comments), but I see under-normalization as a modeling error in most cases, too.

The other category is performance related. In principle, I think there are generally better solutions to performance than denormalizing data, such as materialized views, but if your application suffers from the performance consequences of many joins, it may be worth assessing whether denormalizing can help you. I think these cases are often over-emphasized, because people sometimes reach for denormalization before they properly profile their application.

People also often forget about alternatives, like keeping a canonical form of the database and using warehousing or other strategies for frequently-read, but infrequently changed data.

JasonTrue 2008-11-15 17:47:04

Temporal coupling is a good point and is something that is easy to overlook until 30 days after your implementation goes live. Not that I've been there.

Nicholas Piasecki 2008-11-15 17:49:25

I like your emphasis on alternatives. Note that your first case isn't related to normalization at all. It's a failure of the domain designer to distinguish between a product price and a sale price.

RoadWarrior 2008-11-15 19:24:56

@RoadWarrior - yes, or more precisely, between 'current product price' and 'sale price'.

Jonathan Leffler 2008-11-16 04:59:34

i would think that the first example is not "over-normalization", as the product still logically may have a current price, but under-modeling, since the order-item is (as you point out) temporally bound and should thus snapshot the price at the time of the sale.

Steven A. Lowe 2008-11-16 05:23:04

All these are fair points, though probably a matter of definition. For me, over-normalization includes cases where correctness is compromised (as a result of poor modeling). Non-normalized schemas compromise correctness unless precautions are taken.

JasonTrue 2008-11-16 05:32:10

Although the entire answer is very good, the example of overnormalization is not. It is an example of blatant design mistake. A better example of overnormalization could be something like storing so-called "value objects" in separate tables (which sometimes may NOT be an overnormalization :-).

Yarik 2008-11-16 09:51:57

+1 A:

..or hitting limits on the number of joins your RDBMS will do.

friism 2008-11-15 17:51:38

That's a defective or toy DBMS - time to replace it with a real one.

Jonathan Leffler 2008-11-16 04:57:03

+6 A:

Normalization is absolute. A database follows Normal Forms or it does not. There are a half-dozen normal forms. Mostly, they have names like First through Fifth. Plus there's a Boyce-Codd Normal Form.

Normalization exists for precisely one purpose -- to prevent "update anomalies".

Normalization isn't subjective. It isn't a judgement. Each table and relationship among tables either does or does not follow a normal form.

Consequently, you can't be "over-normalized" or "under-normalized".

Having said that, normalization has a performance cost. Some people elect to denormalize in various ways to improve performance. The most common sensible denormalization is to break 3NF and include derived data.

A common mistake is to break 2NF and have duplicate copies of a functional dependency between a key and non-key value. This requires extra updates or -- worse -- triggers to keep the copies in parallel.

Denormalization of a transactional database should be a case-by-case situation.

A data warehouse, also, rarely follows any of the transactional normalization rules because it's (essentially) never updated.

"Over-normalization" could mean that a database is too slow because of a large number of joins. This may also mean that the database has outgrown the hardware. Or that the applications haven't been designed to scale.

The most common issue here is that folks try to use a transactional database for reporting while transactions are going on. The locking for transactions interferes with reporting.

"Under-normalization," however, means that there are NF violations and needless processing is being done to handle the replicated data and correct update anomalies.

S.Lott 2008-11-15 18:05:31

My take on this:

Always normalize as much as you are able to do. I usually go crazy on normalization, and try to design something that could handle every thinkable future extensions. What I end up with is a database design that is extremely flexible... and impossible to implement.

Then the real job starts: De-normalization. Here you solve what you know would be problematic to implement and/or would slow the queries down because of too many joins.

This way you know what you scarify for make the design usable.

Edit: Documentations! I forgot to mention that documenting the de-normalization is very important. It is extremely helpful when you take over a project to know the reason behind the choices.

Eigir 2008-11-15 18:06:44

The 'every thinkable future extension' is overkill; at most you need to handle probable extensions (not ones that are just possible). This is part of the agile technique - not worrying too much about the future. With a DBMS, some concern about the future is good, but not too much.

Jonathan Leffler 2008-11-16 05:01:55

I understand what you mean, but I believe that the design of the DBMS is the most fundamental part of a project. Mistakes done on that level are the most difficult ones to correct later, since a redesign of the database has a high chance of breaking large parts of the code.

Eigir 2008-11-16 16:29:31

If performance is affected by too many joins, creating de-normalized tables for reporting purposes can speed things up. By copying the data into new tables, it may be possible to run reports with no joins at all.

harriyott 2008-11-17 16:03:11

In my experience, I've never seen a normalized database that contains postal addresses, as it's usually acceptable to store the address as a string. Ideally, there would be tables for countries, counties / states, cities, districts and streets. I've not come across anyone who needs to report on street level, so it hasn't been necessary. The addresses have only be used for postal contact, so are treated as a single entity.

harriyott 2008-11-17 16:07:56

+1 A:

A lot of people are talking about performance. I think a key issue is flexibility. In general, the more normalized your database, the more flexible it is.

We currently use an "over-normalized" database because, in our operating environment, client requirements change on a monthly basis. By "over-normalizing" we can adopt our software accordingly, without changing the database structure.

ng.mangine 2008-11-17 16:17:52

I completely agree. I've used databases with millions of records, and performance was never an issue. The structure of the data needs to flexible enough to allow for multiple different uses and changing requirements without having to change the data structure. Normalization is the answer to this.

Travis Heseman 2009-09-22 18:23:37

ansaurus

tags:

views:

answers:

Overnormalization

related questions