views:

924

answers:

8

I find this comes up a lot, and I'm not sure the best way to approach it.

The question I have is how to make the decision between using foreign keys to lookup tables, or just taking the lookup table value and inserting it into the table requesting the data when you have tables that are mostly based on existing data.

Points to keep in mind:

  • With the second method you would need to do mass updates to all records referencing the data if it is changed in the lookup table.
  • This is focused more towards tables that have a lot of the column's referencing many lookup tables.Therefore lots of foreign keys to means a lot of joins every time you query the table.
  • This data would be coming from drop down lists which would be pulled from the lookup tables. In order to match up data when reloading, the values need to be in the existing list (related to the first point).

Is there a best practice here, or any key points to consider?


Great answers guys, I'm going to keep this open a bit longer for discussion, and see what other feedback we can get.

+1  A: 

Normalization is pretty universally regarded as part of best practices in databases, and normalization says yeah, you push the data out and refer to it by key.

chaos
Only until it doesn't make sense any more. The norm is also to denormalize when performance suffers because of too normalized data. Sometimes a logical structure is just too slow.
Lasse V. Karlsen
I don't disagree. The OP is asking in a pretty general way, though, so I'd tend to say "normalize" until I see that normalization is too much for the situation.
chaos
+6  A: 

Rule of thumb: Normalize the database, benchmark and denormalize only if required to achieve good performance. Never denormalize without benchmarking.

If you had to denormalize, always preserve integrity of data using constraints and triggers.

Mehrdad Afshari
A: 

Since no one else has addressed your second point: When queries become long and difficult to read and write due to all those joins, a view will usually resolve that.

derobert
I always caution on the use of views - they save readability but they may have a significant performance cost.
Jeremiah Peschka
Views are fairly cheap on most databases as they normally function by query rewriting. Check your DB's 'explain' (or similar) output, of course.
derobert
+1  A: 

You can even make it a rule to always program against the views, having the view get the lookups.

This makes it possible to optimize the view and make your code resistant to changes in the tables.

In oracle, you could even convert the view into a materialized view if you ever need to.

Osama ALASSIRY
Materialized views are not Oracle-specific. SQL Server supports indexed views which are basically the same thing, though the syntax is different.
Mehrdad Afshari
+1  A: 

In cases of simple atomic values, I tend to disagree with the common wisdom on this one, mainly on the complexity front. Consider a table containing hats. You can do the "denormalized" way:

CREATE TABLE Hat (
  hat_id INT NOT NULL PRIMARY KEY,
  brand VARCHAR(255) NOT NULL,
  size INT NOT NULL,
  color VARCHAR(30) NOT NULL /* color is a string, like "Red", "Blue" */
)

Or you can normalize it more by making a "color" table:

CREATE TABLE Color (
  color_id INT NOT NULL PRIMARY KEY,
  color_name VARCHAR(30) NOT NULL
)

CREATE TABLE Hat (
  hat_id INT NOT NULL PRIMARY KEY,
  brand VARCHAR(255) NOT NULL,
  size INT NOT NULL,
  color_id INT NOT NULL REFERENCES Color(color_id)
)

The end result of the latter is that you've added some complexity - instead of:

SELECT * FROM Hat

You now have to say:

SELECT * FROM Hat H INNER JOIN Color C ON H.color_id = C.color_id

Is that extra join a huge deal? No - in fact, that's the foundation of the relational design model - normalizing allows you to prevent possible inconsistencies in the data. But every situation like this adds a little bit of complexity, and unless there's a good reason, it's worth asking why you're doing it. I consider possible "good reasons" to include:

  • Are there other attributes that "hang off of" this attribute? Are you capturing, say, both "color name" and "hex value", such that hex value is always dependent on color name? If so, then you definitely want a separate color table, to prevent situations where one row has ("Red", "#FF0000") and another has ("Red", "#FF3333"). Multiple correlated attributes are the #1 signal that an entity should be normalized.
  • Will the set of possible values change frequently? Using a normalized lookup table will make future changes to the elements of the set easier, because you're just updating a single row. If it's infrequent, though, don't balk at statements that have to update lots of rows in the main table instead; databases are quite good at that. Do some speed tests if you're not sure.
  • Will the set of possible values be directly administered by the users? I.e. is there a screen where they can add / remove / reorder the elements in the list? If so, a separate table is a must, obviously.
  • Will the list of distinct values power some UI element? E.g. is "color" a droplist in the UI? Then you'll be better off having it in its own table, rather than doing a SELECT DISTINCT on the table every time you need to show the droplist.

If none of those apply, I'd be hard pressed to find another (good) reason to normalize. If you just want to make sure that the value is one of a certain (small) set of legal values, you're better off using a CONSTRAINT that says the value must be in a specific list; keeps things simple, and you can always "upgrade" to a separate table later if the need arises.

Ian Varley
ummm, in this case, you could have hat.color have a foreign key on color.color. Normalization != creating fake keys. Also, you've basically said that there is no need to move color to its own table if it has no meaning; it is just a arbitrary string.... well, yeah. It isn't a relation then.
derobert
@derobert - Agreed, it's not a relation if it's just arbitrary strings. I took that to be the thrust of his question: when do you make it a relation, and when do you not?
Ian Varley
+8  A: 

You can use a lookup table with a VARCHAR primary key, and your main data table uses a FOREIGN KEY on its column, with cascading updates.

CREATE TABLE ColorLookup (
  color VARCHAR(20) PRIMARY KEY
);

CREATE TABLE ItemsWithColors (
  ...other columns...,
  color VARCHAR(20),
  FOREIGN KEY (color) REFERENCES ColorLookup(color)
    ON UPDATE CASCADE ON DELETE SET NULL
);

This solution has the following advantages:

  • You can query the color names in the main data table without requiring a join to the lookup table.
  • Nevertheless, color names are constrained to the set of colors in the lookup table.
  • You can get a list of unique colors names (even if none are currently in use in the main data) by querying the lookup table.
  • If you change a color in the lookup table, the change automatically cascades to all referencing rows in the main data table.


It's surprising to me that so many other people on this thread seem to have mistaken ideas of what "normalization" is. Using a surrogate keys (the ubiquitous "id") has nothing to do with normalization!

Bill Karwin
+1  A: 

rauhr.myopenid.com wrote:

The way we decided to solve this problem is with 4th normal form. ...

That is not 4th normal form. That is a common mistake called One True Lookup: http://www.dbazine.com/ofinterest/oi-articles/celko22

4th normal form is : http://en.wikipedia.org/wiki/Fourth_normal_form

Ben Dempsey
+1  A: 

One thing no one has considered is that you would not join to the lookup table if the data in it can change over time and the records joined to are historical. The example is a parts table and an order table. The vendors may drop parts or change part numbers, but the orders table should alawys have exactly what was ordered at the time it was ordered. Therefore, it should lookup the data to do the record insert but should never join to the lookup table to get information about an existing order. Instead the part number and description and price, etc. should be stored in the orders table. This is espceially critical so that price changes do not propagate through historical data and make your financial records inaccurate. In this case, you would also want to avoid using any kind of cascading update as well.

HLGEM