views:

840

answers:

21

What are the constituents of a good relational database design?

I'd want to ask this question in a bit unusual way, lets see if anyone likes the idea. Also read on to see how it's not a duplicate of the question about DB design best practices (or any other question for that matter). The idea is that as a result of voting we will have constituents ranked by their perceived importance.

I'd appreciate if you could:

  1. Specify 1 (one) constituent of a good relational database design per answer, give multiple answers if you must.
  2. State the idea in a fewest number of words possible on the top of your answer.
  3. Then give possibly two illustrations:
    • benefit of having the constituent
    • cost of not paying enough attention to the constituent

Please avoid (again if possible) posting duplicate constituents, if something you'd like to highlight is already submitted please choose something else, unless, of course you can give a better illustration. I'm sure there are at least two dozens highly important things one needs to incorporate into the design.

A: 

It depends on the person, but for me 3rd normal form does.

Frank V
+2  A: 

It depends completely on the purpose of the DB but I find the first step is NORMALIZATION

victor hugo
A: 

the database always runs fast

KM
thanks for the down votes! I hope your database runs slow but you are happy with the design!
KM
+9  A: 

First normal form

In day-to-day practice, and ignoring the issue of whether NULL is a value, this winds up meaning "use a primary key".

The benefit of having a primary key is that rows are unique and have a value by which they can be looked up as an O(log n) or O(1) operation.

Without a primary key, rows are not unique and many common, desirable operations using the database are therefore far slower and/or more difficult.

chaos
A: 

easy to maintain/modify without refactoring

KM
thanks for the down votes! I hope your database is a real pain to maintain and modify, but you're happy with the design!
KM
+3  A: 

Using Primary Keys

Primary keys ensure there is a unique id for querying, updating or deleting. It also allows for joins to work faster (as this column is indexed)

Nathan Koop
+3  A: 

The flexibility of NOT using 3rd normal form when necessary. This usually occurs when mindlessly adhering to normalization and ending up with 50 indexes and multiple level joins.

ennuikiller
+9  A: 

Second normal form

2NF, the short version: do not store any information in a table if that information is dependent on only part of a composite key. This is one level of saying that information should "go where it belongs", i.e. if you have a Car table, a Driver table, and a CarHasDriver table, you don't store information that's purely about the Car in the CarHasDriver table.

The benefit is having a coherent point of reference for where to find data, avoiding overly complex or ambiguous queries, avoiding data duplication, and generally not having your database be a big mess.

The cost of not using 2NF is the absence of those benefits. :)

chaos
Kinda surpising these normal forms float to the top. This one in particular is meant for old-school tables where an orderline might have a key of <custnr><ordnr><linenr>, or 027-20080056-001. This rule would instruct you not to store order information in the orderline table. But what is the last time you saw a composite primary key?
Andomar
It's not completely nonsensical to use composite primary keys for many-to-many mapping tables. Seen it, oh, within the last couple of years.
chaos
As to them floating to the top, I have to suspect that enough of us have had to fix databases 'designed' by well-meaning but confused individuals that we have a keen sense of the distinction between 'elementary' and 'unnecessary'.
chaos
Sure, many-to-many mapping tables usually have a composite primary key. But I never saw one that was a 2NF violation.
Andomar
Even if your orderline table does have a surrogate key, it still has a candidate (unique) key of say (order_no, line_no), and it would still violate 2NF to hold an attribute that depends solely on the order_no in this table. I agree you are unlikely to see a violation; that's because the normalisation rules are "formalised common sense".
Tony Andrews
All of our internal databases have at least one instance of a composite PK (usually in joining tables, but not always). It's called data integrity baby!
Keith Williams
+1  A: 

My favourite rule is that you have to weigh technical complexity against business benefit.

An example is a webshop that starts to normalize its database. Say they move the billing and invoice address from the customer table to a separate address table. This adds big technical complexity:

  • The customer editor will now have to support a random number of addresses
  • Sql queries with address require extra joins
  • A new relation between orders and addresses is introduced
  • Updates to the customer might deadlock with updates to the address table; you have to take care of the order of update statements
  • ... and so on ...

The business benefit of this technical complexity is (usually) low; the separate address table enables no new functionality.

The price you pay is that your competition will spend this time on things that customers appreciate, and leave you behind in the market.

Andomar
+6  A: 

Using Appropriate Datatype & datasizes

Using a varchar(50) to represent a code that is always 5 characters long is wrong, also using a varchar(20) to represent a city name or country name is wrong too. Please size the varchar columns correctly. Also use the appropriately sized int or float, if a value is always going to be less than 32,768 use an int (not a bigint, as I've seen done)

Nathan Koop
and do not store dates in varchar fields!
HLGEM
@HLGEM amen...
Nathan Koop
+9  A: 

Third normal form

3NF, oversimplified, is not storing information in a table that is unrelated to its unique or potentially unique keys. For instance, if you have a Car table, a Driver table, and a CarHasDriver table, you do not store CarAndDriverLastSeenStateAbbreviation and CarAndDriverLastSeenStateFullName in CarHasDriver, because now you're using CarHasDriver to store non-key information about States.

3NF is somewhat "2NF but more so", and the benefits are an extension of 2NF's facilitation of having a single, coherent point of reference for information on a particular topic.

The drawbacks of not using it are similar, though minor violations of 3NF are generally considered much more forgivable (such as storing State as a 2-letter state code instead of a key into a State table).

chaos
3NF is a historical curiosity. Normally, it is equivalent to BCNF, which is the preferred term - but there is an odd-ball case allowed by 3NF that is not allowed by BCNF.
Jonathan Leffler
So write up BCNF as an answer, dude. :)
chaos
Ah, never mind, someone got the jump on you.
chaos
+4  A: 

Application Determined Indexing

The database's non-primary keys should be constructed according to the queries which will be made against it by the application, potentially with compound keys designed to optimize the most speed-critical queries.

The benefit is that the application runs much faster without unnecessary cost.

If you ignore this, either you have an under-indexed table that slows down the application, possibly drastically, for no good reason, or you have an over-indexed table that wastes disk space and slows down INSERTs and UPDATEs for indexes that no one is actually using.

chaos
+3  A: 

Object Representability

The entities in the database should be chosen so that they correspond naturally to object representations that developers working with the application would choose to work with. For example, if you have a Car table, its columns should make equal amounts of sense as the properties of a Car object, and if your developers' model of the system has Users who may or may not be Drivers, it's probably a good idea to work parallel to that at the database level.

The benefit of this is that it aids greatly in using object-relational mapping to interact with the database, otherwise makes database-application integration flow more easily, and in practice tends to push you in the direction of using 2NF and 3NF, which is all to the good.

The risk of not doing this is making extra, unnecessary work for developers and getting a reputation as the coder-unfriendly database guy. :)

chaos
+1  A: 

Let me creatively misinterpret this as, "What are the constituents of good RDBMS (i.e. the software that manages a database in a relational manner) design?" In no particular order:

Independence of logical and physical layout. I should never have to duplicate an attribute or join two relations in my logical layout in order to speed queries; I should just tell the physical layer to do that, and have it ensure that data never gets out of sync transparently to me. The only time I should hear about it is when a change to the logical schema is incompatable with the existing physical schema, prompting me to make the appropriate changes. Thus as an admin, I can turn a row-oriented storage format into a column-oriented storage format or vice versa, or even maintain both, and my clients see nothing but faster queries. (The same goes for optimization hints, by the way: those should be entirely separate from the logical query.) As a poster previous to me mentioned, the physical design and changes should be automated to some degree, based on query history or whatever.

Abolishment of "primary keys." A candidate key is a candidate key, and should not have any logical priority over another. If you want one set of attributes indexed, but not another, that goes in the physical specification, not the logical. A relational query language that is, more or less, the relational algebra. Well beyond the various bits of small cruft one expects to accumulate here and there (and I'm not even counting the physical storage hints in this), SQL has some very basic things wrong with it that make it non-relational, and very hard to use if you're trying to use a relational model. For years now, my standard technique for formulating complex queries has been to spend twenty minutes specifying it in some fairly pure form of the relational algebra, and then forty minutes trying to translate that into SQL.

Real types, specified as easily as real types in my programming language of choice. If I have complex numbers, they should have an attribute in a relation of a single complex number, not two separate attributes as real and imaginary. And functions should be defined for types, not defined for certain types and also-does-god-knows-what for different types that come along. Yes, I'm talking about abolishing NULL, though not at all about losing the capability for it. I don't mind creating a type of "all signed integers expressable in 32 bits, plus 'unknown' and 'out of range.'" But when I call the "average" function on a set of those values, I don't want it handing me a number and snickering behind its back saying, "Ha ha! I wonder if what I just made up is similar to the answer he'd come up with?" I want it to say, "I know average for integers; for what you've got, you'd better tell me what you think it means."

(An example: "sum(3, 2, unknown)" is not "5", nor is it "unknown". It's, "at least 5."

Get rid of or fix these "ORM" layers. Going and spending a lot of CPU cycles to have a relational DBMS instead of a hierarchial or network DBMS, only in order to spend yet more CPU cycles to make that relational DBMS look like a hierarchical or network DBMS later on is not only a waste of CPU, but also leads to errors in one conversion being multiplied in another. Give me direct relational queries in my language. (I'm not a big MS fan, but I give them props for LINQ, which is a step in this direction.)

This question hit a sore point, so there's my rant.

Curt Sampson
sum(3,2,unknown) is only 'at least 5' if the values are constrained to be non-negative; if the actual value marked 'unknown' is -100, then the answer is not 'at least 5'
Jonathan Leffler
True enough, for integers. I should have mentioned that I was thinking of a cardinal field; a typical example would be for systems keeping track of inventory.
Curt Sampson
+8  A: 

Using Referential Integrity (foreign keys)

One of the mainstays of a good DB design is to make sure relationships between data tables and/or data tables and lookup tables are "secured" by means of referential integrity (using foreign key relationships). This helps ensure data integrity, avoids "stale" or wrong data.

marc_s
And make sure to index them when you create them!
HLGEM
Of course! That's a given (almost) :-)
marc_s
marc, you'd be surprised how many people think they get indexed automatically and then wonder why their system is slow
HLGEM
+3  A: 

Boyce-Codd Normal Form (BCNF)

This is the more complete, improved version of Third Normal Form. OK, to be honest I always have to look it up to remember what it means exactly, but if you look at the example in the Wikipedia page, it clearly makes sense that there should be a Rate Type table.

The advantage is that it removes a potential update anomaly that 3NF by itself would allow.

Tony Andrews
Definitions for BCNF, 4NF, and 5NF (aka PJNF) have pleasing symmetry:BCNF: Every non-trivial FD (functional dependency) is a consequence of a candidate key; 4NF: Every non-trivial MVD (multi-valued dependency) is a consequence of a candidate key; 5NF: Every non-trivial JD (join dependency) is a consequence of a candidate key. And if you accept Date, Darwen and Lorentzos's 6NF: There are no non-trivial JD.
Jonathan Leffler
+1  A: 

Using Natural Primary Keys

When possible, each entity should have an identifier that is used by the user community to identify that entity. If you can trust that your data entry people will get that identifier right, declare it as a primary key, rather than inventing an ID column that you then need to keep private.

Declaring a natural primary key will prevent duplicates where it really counts. It also prevents missing data where it really counts. Using a surrogate ID column as a primary key misses the biggest reason to have primary keys: to prevent bad data from undermining entity integrity. If your data lacks entity integrity, ensuring referential integrity will only win half of the battle for you.

For tables that represent many-to-many relationships among entities, this may mean composite primary keys. That's a small price to pay in efficiency compared to the loss of integrity that you might get by not declaring such a key.

Note that there is nothing fundamental in relational theory that makes one candidate key better than another as a choice for primary key. But if you know how your users use the data, you'll be able to choose a primary key among the candidates.

The index you get when you declare a primary key is only a secondary benefit. The prime benefit is data integrity.

Walter Mitty
I used to feel the same way and still agree with some of what you say, but as I've worked with increasingly complicated databases, the surrogate key approach is winning me as it makes foreign key relationships easier to manage. However, in cases where the primary key is simple or a single column, then naturally it works just fine to keep the natural key. Also, having a surrogate PK doesn't mean you can't also have a unique index identifying the natural key.
Bernard Dy
@Bernard Dy: if you have a surrogate PK, it is crucial that you also have a unique constraint (usually equivalent to a unique index) identifying the natural key. Otherwise, you end up with two entries for the same natural key with different surrogate keys, leading to endless problems.
Jonathan Leffler
@Jonathan: You are absolutely correct, sir. I have used words that misrepresent the importance of having a meaningful unique key for the data whether it be a natural PK or something in addition to a surrogate key. No "edit comment" ability, but I can thank you for emphasizing this critical point.
Bernard Dy
@Bernard Dy: precisely. There are significant benefits to using a key that is simple (single basic datatype), anonymous (meaningless), and immutable (won't be changed). A "natural key" will often violate at least one of these, and sometimes selected natural keys turn out (eventually) to be unique only _most_ of the time.
spencer7593
Just be careful that a natural value really is a candidate key. Many that look as if they are, are not. For example, social security numbers are frequently used as candidate keys, but that turns out to be a mistake, because they change, and also because people make errors writing them down, entering them, etc. It's quite possible that, for a period of time, two people have the same SSN in your database until you figure out which one is wrong (or whether you have a duplicate entry).
Curt Sampson
+5  A: 

Never store more than one piece or type of information in a field. Comma delimited lists should be related tables. Email should not store general notes as well as emails. (It doesn't really work well to send an email to "Talk to his secretary.")

HLGEM
Looks like a "bug" you personally had to deal with there! "Why isn't our email batch sending program working anymore?"
Chris Marisic
+2  A: 

Simple, Artificial, Opaque Primary Keys

Give all tables a single-column (simple, as opposed to compound) primary key that is artificial (constructed by the system rather than being a natural part of the data) and opaque (encodes no semantic content other than its function as a key), also known as a surrogate key. The easiest way to do this is usually to use an automatically generated serial primary key, though hash and GUID keys are often nearly as easy to use and have the minor security benefit of not being serial.

Benefits include retrieval by primary key being simpler and faster than with a compound key, the ability to use keys in content like Web forms without exposing potentially sensitive information about the records they index, and uniformity of interface in knowing that any given table you work with will have this feature.

If you do not use single-column primary keys, this makes your database more cumbersome to use. If you do not use artificial, opaque primary keys (or at least a unique artificial opaque key somewhere in the table), I certainly hope your database isn't touched by the Web.

chaos
+4  A: 

For an OLTP system, the primary key for all Entity tables should be defined as:

  • simple (single column, basic datatype)
  • anonymous (meaningless, carries no semantic meaning)
  • immutable (value will never be changed once assigned)
  • unique (no duplicate values allowed)
  • not null (no nulls allowed)
  • enforced by declarative constraint(s)

Some authors refer to these qualifications amounting to a "surrogate" or an "artificial" primary key.

I've not been burned by following these guidelines for a primary key for all "Entity" tables managed by an application.

I have witnessed application users and developers burned by not following these guidelines. Just as one example, I was called in to assist in a rewrite of a "pension" system which used social security number as a primary key for a pension. Payees receiving two or more pension payments had "dummy" social security numbers assigned. The users added handwritten note of the 'real' social security number on printed reports, and manual corrections were made when reporting to the IRS.

The database had to be modified, and a whole bunch of application code had to be changed, because the database and application had been designed around a "natural key" which was almost unique.


My personal preference is to give the column name "ID" to the primary key column of each table. I also prefer for the primary key values to be unique within the schema.

For a non-Entity table (that is, a relationship table, that resolves a many-to-many relationship between entity tables, and that has no attributes of its own and no possibility of child tables, then I will consider using the combination of the two foreign keys as a primary key. Such as "relationship" table is not mapped to an object in the application. Often however, I find that the relationship itself may have some attributes of its own, or there is a possibility that it will be a parent to another table. So, it doesn't hurt to go ahead and add a single column "ID" to the table, as well as the unique constraint on the combination of two foreign keys.


Of course, most Entity tables will also have a "natural key", which will consist of one or more columns or expressions. The individual columns and/or expressions that compose the natural key should be NOT NULL and the combination should be UNIQUE, and those constraints should be enforced by declarative constraints in the database.

spencer7593
The case of SSN to identify a pension is simply a case of a wrong natural key. A good data analysis would have prevented this design decision. An SSN isn't even an attribute of a pension. At best, it's a foreign key, referencing a person. Even that's dangerous.In reality, SSN is a key attribute of an account with the social security administration. Using it to identify a person is stretching its scope. Using it to identify a pension is just plain wrong.
Walter Mitty
@Walter: I don't think the selection of SSN as a natural key was necessarily right or wrong. It was the decision that was made by the development team, and the system was implemented and met most of the requirements of the users. So, in that sense, it wasn't a _wrong_ choice. The original question requested an illustration, I chose a real-world example I am familiar with. It does illustrate the difficulty in selecting a "natural key". What is the natural key for pension? For payee? And does that key abide the 6 guiding principles listed in my answer?
spencer7593
The last paragraph is a crucial qualifier that seems to be overlooked in some circles.
Jonathan Leffler
+1  A: 

Before you start designing a table make sure that you really understand what your data is intended to represent. The first rule of database design is "Know thy data." Make sure that you understand the logical data structure inherent in your data. As far as possible keep the physical structure the same as the logical structure. Don't deviate from following the logical data structure until you understand the price that you will pay, because there is always a price to be paid.