views:

943

answers:

23

What is important to keep in mind when designing a database?

I don't want to limit your answer to my needs as I am sure that others can benefit from your insights as well. But I am planning a content management system for a multi-client community driven site.

+20  A: 

(Assuming OLTP)

Normalisation of your data-structures. (Performance de-normalisations can generally follow later where needed)

http://en.wikipedia.org/wiki/Database_normalization

cagcowboy
+26  A: 

"Normalize till it hurts; de-normalize till it works."

Keng
Never heard that before... nice!
cagcowboy
Excellent way to put it!
Abyss Knight
This is going on my wall
Raj More
Who said it? Good answer anyway, +1.
fastcodejava
+7  A: 

Try to imagine the SQL queries that you will preform against it.

This is important because you will do it A LOT!

Daniel Silveira
+1  A: 

If you have queries that you're going to be running A LOT, make them into stored procedures. They will almost always run faster.

Keng
Any evidence of this? I think it's not true and I haven't found any solid benchmarks.
S.Lott
This is a good practice whether it makes the query run faster or not.
Leigh Riffel
+1  A: 

If you'll be looking rows up by fields other than the primary key, make sure to index them.

R. Bemrose
Good tip but not really a design issue IMO
Joe Philllips
+13  A: 

Make sure you use constraints (CHECK, NOT NULL, FOREIGN KEY, PRIMARY KEY, and DEFAULT) to ensure that only correct data is stored in the database in the first place. You can always buy faster hardware but you cannot buy more correct data.

jalbert
A: 

Don`t use a large set of columns as primary keys

Daniel Silveira
You'll need to have every one of those columns in any table that references the one with the multi-column primary key.
R. Bemrose
That's not necessarily true, as you can always use alternate keys for your foreign key (at least in MS SQL). It's also not a bad thing to be carrying those columns around to tables that have FKs into the main table. The extra space is usually unimportant compared to other advantages that you gain.
Tom H.
In the program, it is kind of tedious to have to carry out allll these columns around in variables in your code.
Daniel Silveira
+1  A: 

Is it to an Object Oriented language? So try modelling your objects before the database. This will help you to focus on the model.

Daniel Silveira
+1  A: 

Understand the requirements as much as you possibly can up front. Then design a logical schema that will only have to change if the requirements change, or if you migrate to a completely different kind of database, like one that doesn't use SQL. Then refine and extend your design into a physical design that takes into account your particular DBMS product, your volume, your load, and your speed requirements.

Learn how to normalise, but also learn when to break the normalization rules.

Walter Mitty
+6  A: 

Some things I would keep in mind. Make sure every table has a way to uniquely identify records (you will save untold hours of pain doing this). Normalize but do not join on large multi-column natural keys unless you want the whole thing to be slow. Use a numeric key that is autogenerated in the parent table instead.

Yes, think about the kinds of queries and reports you will need to run. Think about extensibility. It may seem like you wan't need more than 10 products columns in the order table but what happens when you need 11. Better to have an order table and an order detail table.

Make sure all data integrity rules are incorporated into the database. Not all data changes happen from the user interface and I've had to try to fix too many badly messed up databases because the designers figured it was OK to put all rules in the GUI.

The most critical things to consider when desiging are first how to ensure data integrity (if the data is meaningless then the database is useless) and second how to ensure performance. Do not use an object model to design a relational database unless you want bad performance.

The next most important thing is data protection and security. Users should never have direct access to the database tables. If your design requires dynamic SQL they will have to have that access. This is bad from the perspective of potential hacking in through things like SQL injection attacks, but even more importantly, it opens up your database for internal people commit fraud. Are there fields where you need to encrypt the data (credit card information, passwords, and Social Security numbers are among the items that should never be stored unencrypted). How do you plan to do that and how do you plan to audit decryption to ensure people are not decrypting when they have no need to see the data. Are there legal hoops you must go through (HIPPA and Sarbanes Oxley spring to mind)?

HLGEM
+12  A: 

Establish consistent naming standards up-front. It will save several minutes of unnecessary thinking in the long run. (This may read as irony, but I am serious.)

And don't abbreviate anything, unless it is extremely common. Don't turn the database into a license-plate message guessing game. It's amazing what becomes not-obvious after a year.

Jeffrey L Whitledge
A: 

Remember that normalisation is only relative to what you are modelling. Perhaps you are modelling a collection of objects in your domain. Maybe you are recording a series of events, in which data are repeated because the same data happen to apply at more than one time. Don't mix up the two things.

Marcin
A: 

I agree that knowing about your data is good and normalizing.

Something else I would suggest is to keep very large text fiels in a separate table. For example, if you have a contract you might want to keep a lot of the information about the contract in one table but keep the legal (and very large) document in a separate table. Just put in an index from the main table into the legal document.

Ben
Why? I mean, don't tell me that you are doing SELECT * FROM Table :)
Luk
This advice seems implementation dependant (on the DB engine end). I don't think it's relevant on PostgreSQL or Oracle, but don't know. There *is* a SO question somewhere...
voyager
+1  A: 

I strongly echo that normalization is critical, with tactical de-normalization to follow for performance or other maintainability reasons. However, if you're expecting to have more than just a few tables, I'd like to offer one caveat about normalization that will make your life a lot easier as the number of tables grows.

The caveat is to make the primary key for each table a single numeric column (appropriate for your flavor of DB). In academic normalization, the idea is to combine whatever attributes (columns) of an entity (table) so that you can uniquely identify an instance of what is being described (row), and you can end up with a multi-column composite primary key. So then whenever you migrate that composite key as a foreign key to other tables, you end up duplicating those multiple columns in every table that references it. That might work for you if you only have half a dozen tables. But it falls apart quickly when you go much bigger than that.

So instead of a multi-column composite primary key, go with a sequential numeric primary key even though that approach goes against some of the strict normalization teachings.

Ed Lucas
+3  A: 

Get a really good book on data modeling - one written by a true database developer, not a .NET developer who tries to teach you how it's done in the "real world".

The problem space of database design is simply way too large to be significantly covered in a forum like this. Despite that though, I'll give you a few personal pointers:

Listen to the above posts about normalization. NEVER denormalize because you THINK that you have to for performance reasons. You should only denormalize after you've experience actual performance issues (ideally in your QA environment, not production). Even then, consider that there may be a better way to write your queries or improve indexing first.

Constrain the data as much as possible. Columns should be NOT NULL as much as possible. Use CHECK constraints and FOREIGN KEYs wherever they should be. If you don't do this, bad data will get into your database and cause a lot of headaches and special case programming.

Think through your data before you actually start designing tables. Get a good handle on how your processes will flow and what data they will need to track. Often times what you think is an entity at first glance turns out to be two entities. As an example, in a system that I'm working on, the previous designer created a Member table and all of the information from their application was part of the Member table. It turns out that a Member might want to change data that was on their application, but we still need to track what the original application looked like, so the Application is really its own entity and the Member is an entity that might initially be populated from the Application. In short, do extensive data analysis, don't just start creating tables.

Tom H.
I love your point about 'NEVER denormalize because you THINK that you have to for performance reasons.'
Raj More
+2  A: 

Since there have been several posts advocating this now, I'll add one more thing...

DON'T fall into the trap of putting ID columns on all of your tables. There are many VERY good reasons why modern database design theory uses real primary keys and they aren't strictly academic reasons. I've worked with databases that included hundreds of tables, many of which were multi-million row tables, with over 1000 concurrent users and using real primary keys did not "break down".

Using ID columns on all of your tables means that you will have to do multi-table joins to traverse across the database, which becomes a big hassle. It also tends to promote sloppy database design and even beyond that often results in problems with duplicate rows. Another issue is that when dealing with outside systems you now have to communicate these IDs around.

There are places for surrogate IDs - type code tables and conceptual tables (for example, a table of system rules could use an ID if the rules don't have real-world identifiers). Using them everywhere is a mistake IMO.

It's a long-standing debate, but that's my opinion on the matter, for what it's worth.

Tom H.
It's not a debate. Surrogate keys work to make the data maintainable. "Natural Keys" or "Real Keys" impose weird restrictions that sometimes mirror the real world, but other times mirror the relational model.
S.Lott
It's only "not a debate" if you're so arrogant to think that your opinion is the only valid one. As I said, I've worked with databases that measured in the hundreds of tables and the data was easily "maintainable". Any restrictions imposed are actual restrictions, or your model is the problem.
Tom H.
The issue isn't "can a smart person manage it". Clearly, your smart and you can manage it. The issue is that failing to put immutable surrogate keys in leads to odd, quirky restrictions on what the database can and can't do.
S.Lott
I like to do both. I have auto-generated surrogate keys and uniquely constrained "natural" keys for almost every non-junction table.
Gilligan
A: 

I'd say an important thing to keep in mind is that the structure may change. So don't design yourself into a corner. Make sure whatever you do leaves you some "room" and even an avenue to migrate the data into a different structure some day.

arin sarkissian
+3  A: 

Data Is Eternal. Processing Comes and Goes.

Get the relational model to be a high-fidelity representation of the real world. This matters more than anything else.

Processing will change and evolve for years. But your data -- and the data model -- can't evolve at the same pace and with the same flexibility. You can add processing, but you can't magically add information. You don't want to delete information (but you can ignore it.)

Get the model right. The entities and relationships in your diagrams should make rational sense to a casual non-technical user. Even the application programming should be simple, clear and precise.

If you're struggling with the model, don't invent big, complex queries or (worse) stored procedures to work around the problems. Procedural work-arounds are a costly mistake. Understand what you have, what you want to do, and apply the YAGNI principle to pare things down to the essentials.

S.Lott
+1  A: 

Make sure that as much meta data as possible is encoded in the model. It should be possible to infer almost any business rule or concept from just looking at the data model.

This means, take care to pick names that reflect the reality of the users (but don't be afraid to change their perception of reality if it helps the model).

Encode all constraints you can in the database. Don't rely on the application layer to only supply sensible data. Make sure that only sensible data can exist in the first place.

Don't do aggregate data in the model. Keep the model as atomic as possible. Either aggregate on the fly or run regular aggregation jobs into aggregate tables.

Pick a good partition between schemas. Some partitioning makes sense to do with foreign keys, and some by pure physical seperation.

John Nilsson
+2  A: 

I know this has been stated, but normalization, normalization, normalization is the key. If there is an instance where you feel that for whatever reason that you need to store data in a non-normalized format, don't do it. This should be handled through views or in a separate reporting database. My other key advice is to avoid text/ntext fields wherever possible.

CNote
+2  A: 

"Thumb rule of Databases - Down Always Beats Across!"

Examples: If you have a Customer table with columns for Mailing Address and Shipping address and Billing address... Create a separate CustomerAddress table with an Address Type

If you have a CancellationDetails table with CancellationReason01, CancellationReason02, CancellationReason03.. create a separate CancellationReason table

Raj More
If you have a CancellationDetails table with CancellationReason01, CancellationReason02, CancellationReason03... you have a **horrible, horible mess**. And we've got some of those in our system.
voyager
+2  A: 

Be practical. Keep in mind what your goals are and don't go crazy creating unnecessary complexity. I have some preferences:

  • Keep the number of tables small
  • prefer narrow tables over wide ones full of null values.
  • Normalization is generally good
  • Triggers are typically very painful

But these are a means to an end (and are contradictory in many cases and require careful balancing), the main thing is to let the requirements drive the design. Your choice of what is a separate entity, and what is part of another entity, and what is cat food (not anything whose identity you care about) depends entirely on your requirements.

Nathan Hughes
I don't necessarily disagree with you, but I note with amusement that you recommend both few tables and small tables. This can be summarized as "store little data". I guess that's a pretty good recommendation, since it helps avoid redundancy. :-)
Jeffrey L Whitledge
@Jeffrey: You're right, there is definitely a balancing act involved. things like, more joins or wider tables?
Nathan Hughes
A: 

As much as you can make primary key a sequence generated number.

fastcodejava