Database Normalization

views:

251

answers:

+7 Q:

Database Normalization

I'm new to database design and I have been reading quite a bit about normalization. If I had three tables: Accommodation, Train Stations and Airports. Would I have address columns in each table or an address table that is referenced by the other tables? Is there such a thing as over-normalization?

Thanks

i think in this situation it is ok to have adress columns in each table. you'll hardly have an adress which will be used more than 2 times. most of adresses will be used just 1 per entity.

but what could be in an extra table are names of streets, cities, countries...

and most important every train station, accomodoation and airport will probably have just 1 adress so it's an n:1 relation.

nWorx 2010-07-19 15:38:14

*Many* business-type entities have multiple addresses for a single physical location.

DaveE 2010-07-19 16:02:42

+3 A:

For addresses, I would almost always create a separate address table. Not only for normalization but also for consistency in fields stored.

As for such a thing as over-normalization, absolutely there is! It's hard to give you guidance on what is and isn't over-normalization as I think it mostly comes from experience. However, follow the books on each level of normalization and then once it starts to get difficult to see where things are you've probably gone too far.

Look at all the sample/example databases you can as well. They will give you a good indication on when you should be splitting out data and when you shouldn't.

Also, be well aware of the type and amount of data you're storing, along with the speed of access, etc. A lot of modern web software is going fully de-normalized for many performance and scalability reason. It's worth looking into those for reason why and when you should and shouldn't de-normalize.

Robin Day 2010-07-19 15:42:15

I disagree, or understand the question different. You cannot formally over-normalize a model after normalizing it to the ultimate normal form.

TheBlastOne 2010-07-19 16:28:13

@TheBlastOne: By "ultimate normal form" do you mean 6NF? In SQL, 6NF is not generally very practical. So you might say 6NF was "over normalized" if by implementing it that way you were prevented from enforcing some key or dependency. In special cases the same can be said of BCNF and 5NF (anything above BCNF is not dependency-preserving). In any case, as has already been said, putting addresses in a different table has no obvious connection with normalization per se.

dportas 2010-07-19 18:21:57

@TheBlastOne: Formally doing something and reality are very different things. I would consider normalization when there was no need for it over-normalization.

Robin Day 2010-07-20 06:48:18

@dportas, @Robin Day: Yeah, good points. That's why I remained vague ;) but I think 6NF is the formal ultimate NF, being impractical in reality most of the time.That's a matter of definition. I'd say normalization is the formal process, which is "complete" once you reach the (whichever) ultimate normal form, and modeling it in a practical way is complete when your model suits your needs, which is a very discussable matter. But you cannot normalize anymore once you've formally reached the ultimate normal form -- so over-normalization is not possible in my world of definition.

TheBlastOne 2010-07-20 10:29:04

+2 A:

Personally I'd go for another table.

I think it makes the design cleaner, makes reporting on addresses much simpler and will make any changes you need to make to the address schema easier.

If you need to have it denormalized later on you can always create two views that contain the Train station and airport information along with any address information you need.

Abe Miessler 2010-07-19 15:43:42

+3 A:

Would I have address columns in each table or an address table that is referenced by the other tables?

Can airports, train stations and accommodation each have a different address format?

A single ADDRESS table minimizes the work necessary dealing with addresses - suite, RR, postal/zip code, state/province...

Is there such a thing as over-normalization?

There are different levels of normalization. I've only encountered what I'd consider poor design rather than normalization.

OMG Ponies 2010-07-19 15:49:39

I can certainly imagine that an airport could have a train station, and thus they would share an address.

Gabe 2010-07-19 16:36:34

@Gabe: I can imagine a lot of things, but consider a more plausible situation: Numerous businesses can occupy a building - do they all have the same address?

OMG Ponies 2010-07-19 17:12:19

The businesses would all have the same address in terms of getting directions or plotting on a map, but they would all likely have a different mailing addresses (because they would have a different floor or suite). Of course, if the address changed of the building changed ("Sears Tower" renamed to "Willis Tower" or "E. 155 St." changed to "JFK St."), it would make sense to change them all at once.

Gabe 2010-07-19 19:18:10

@Gabe: No address references a building - street only, which are seldom renamed.

OMG Ponies 2010-07-19 19:26:34

This application appears to be the type where you're displaying locations on a map, in which case everything in the same building would indeed have the same address.

Gabe 2010-07-19 19:55:51

@Gabe: You'd use Lat/Long (better yet, UTM) for geospatial co-ordinates, not an address like what the OP is asking normalization about.

OMG Ponies 2010-07-19 20:00:06

OK, pretend I wrote "This application appears to be the type where you're giving driving directions". In other words, your address needs to be something you can give to a cab driver. That means a mailing address wouldn't work (driving to a PO box isn't what you want) and lat/lon is useless.

Gabe 2010-07-19 20:14:21

@Gabe: I'm not pretending anything - my answer regards what the OP asked.

OMG Ponies 2010-07-19 20:30:20

I don't understand what you're getting at. It just looks like the OP is creating an application where multiple types of entities at the same location would share an address. Therefore it seems that the proper normalization is to have a separate table of addresses.

Gabe 2010-07-19 20:54:46

@Gabe: That is exactly what @OMGPonies is suggesting! "A single ADDRESS table minimizes the work necessary". I think you're both arguing the same point from different directions!

Robin Day 2010-07-20 06:51:03

If you are using Oracle 9i, you could store address objects in your tables. That would remove the (justified) concerns about address formats.

Brian Hooper 2010-07-19 15:52:23

+1 A:

This isn't really what I understand by normalisation. You don't seem to be talking about removing redundancy just how to partition the storage or data model. I'm assuming that the example of addresses for Accommodation, Train Stations and Airports will all be disjoint?

As far as I know it would only be normalisation if you started thinking along the lines. Postcode is functionally dependant upon street address so should be factored out into its own table.

In which case this could be ever desirable or undesirable dependant upon context. Perhaps desirable if you administer the records and can ensure correctness less desirable if users can update their own records.

Martin Smith 2010-07-19 15:57:34

+1 A:

If you have a project/piece of functionality that is very performance sensitive, it may be smart to denormalize the database in some cases. However, this can lead to maintenance issues for various reasons. You may instead want to duplicate the data with cache tables but there are drawbacks to this as well. It's really a case by case basis but in normal practice, database normalization is a good thing. 99% of the non-normalized databases I've seen are not by design, but rather by a misunderstanding/mistake by the developer.

smp7d 2010-07-19 16:03:27

I agree with S.Lott, and would like to add:

A good answer depends on what you know already. The basic "math" of relational database theory, however, defines very well-defined, distinct levels of normalization. You cannot normalize anymore when you've reached the ultimate normal form.
Depending on what you want to model with your three entities, and how you identify them, you can come up with very different conceptual data models, all of which can be represented in a mix of normal forms -- or unnormalized at all (like 1 table for all data with descriptors and NULL holes all over the place...). Consider you normalize your three entities to the ultimate normal form. I can now introduce a new requirement, or use case, or extension, which gives an upto-now descriptive attribute a somehow ordered, or referencing, or structured nature if you look at its content. Then, the model should represent this behavior, and what used to be an attribute perhaps will better be a separate entity referenced by other entities.
Over-normalization? Only in the sense that can you normalize a given model so it gets inefficient to store, or process, on a given DB platform. Depending on what can be handled efficiently there, you might want to de-normalize certain aspects, trading off redundancy for speed (data warehouse dbs do this all the time), and insight, or vice versa.

All (working) db designs I've seen so far either have a rather normalized conceptual data model, with quite some denormalization done at the logical and/or physical data model level (speaking in Sybase PowerDesigner terms) to make the model "manageable" -- either that, or they were not working, i.e. failed because the maintenance problems became kingsize real quick.

TheBlastOne 2010-07-19 16:22:17

When you say "address", I presume you mean a complete address, like street, city, state/province, maybe country, and zip/postal code. That's 4 or 5 fields, maybe more if you allow for "address line 1" and "address line 2", care-of's, etc. That should definately be in a separate table, with an "addressid" to link to the Station, etc tables. Otherwise, you are creating 3 separate copies of the same set of field definitions. That's bad news because it creates extra effort to keep them consistent. Like, what if initially you are only dealing with U.S. addresses (I'm an American so I'll assume U.S.), but later you find you also need to allow for Canadians. You'll need to expand the size of the postal code field and add a country code. If there's a common table, then you only have to do this once. If there isn't, then you have to do this three times. And it's likely that the "three times" is not just changing the database schema, but changing every place in your programs that processes an address.

One of the benefits of normalization is to minimize the impact of changes.

Jay 2010-07-19 16:42:21

+1 A:

Would I have address columns in each table or an address table that is referenced by the other tables?

As others have alluded to, this is not really a question of normalization because you're not attempting to reduce redundancy or organize dependencies. Either way is perfectly acceptable. Moving the addresses to a separate table might make sense if you are going to have centralized validation or business logic specific to addresses.

Is there such a thing as over-normalization?

Yes. As has been mentioned, in large systems (lots of data, lots of transactions, or both) you can normalize to the point where performance becomes an issue. This is why lots of systems use denormalized database for reporting and querying.

In addition to performance though, there is also the issue of how easy the data is to query. In systems where there will be a lot of end-user querying of the data (can be dangerous!), a denormalized structure is easier for most non-technical or non-database people to understand.

Like most things we deal with, it's a trade-off between understanding, performance, and future maintainability and there is rarely a clear-cut answer to where you draw the line in any given system.

With experience, you will learn where the line is best drawn for the systems you write.

With that said, my preference is to err on the side of more vs less normalization.

RWGodfrey 2010-07-19 16:44:44

There are times when you want to denormalize to make queries more efficient. But this should be done very cautiously, only after you have good reason to believe that the fully normalized model creates serious inefficiency problems. In my humble experience, most programmers are far to quick to denormalize, usually with a quick "oh, breaking that out into a separate table is too much trouble".

Jay 2010-07-19 16:45:52

+2 A:

Database Normalization is all about constructing relations (tables) that maintain certain functional dependencies among the facts (columns) within the relation (table) and among the various relations (tables) making up the schema (database). Bit of a mouth-full, but that is what it is all about.

A Simple Guide to Five Normal Forms in Relational Database Theory is the classic reference for normal forms. This paper defines in simple terms what the essence of each normal form is and its significance with respect to database table design. This is a very good "touch-stone" reference.

To answer your specific question properly requires additional information. Some critical questions you have to ask are:

Is an Address a simple fact (e.g. blob of text) or a composite fact (e.g. composed of multiple attributes: Address line, City Name, Postal Code etc.)
What are the other "facts" relating to "Accommodation", "Airport" and "Train Station"?
What sets of "facts" uniquely and minimally identify an "Airport", an "Accommodation" and a "Train Station" (these facts are typically called a key or candidate key)?
What functional dependencies exist among Address facts and the facts composing each relations key?

All this to say, the answer to your question is not as straight forward as one might hope for!

Is there such a thing as "over normalization"? Maybe. This depends on whether the functional dependencies you have identified and used to build your tables are of significance to your application domain.

For example, suppose it was determined that an address was composed of multiple attributes; one of which is postal code. Technically a postal code is a composite item too (at least Canadian Postal Codes are). Further normalizing your database to recognize these facts would probably be an over-normalization. This is because the components of a postal code are irrelevant to your application and therefore factoring them into the database design would be an over-normalization.

NealB 2010-07-19 18:02:53

ansaurus

tags:

views:

answers:

Database Normalization

related questions