views:

195

answers:

8

I know, I quite dislike the catch-all survey type questions, but I couldn't think of a better way to find out what I think I need to know. I'm very green in the world of database development, having only worked on a small number of projects that merely interacted with the database rather than having to actually create a new one from scratch. However, things change and now I am faced with creating my own database.

So far, I have created the tables I need and added the columns that I think I need, including any link tables for many-many relationships and columns for one-to-many relationships. I have some specific questions on this, but I felt that rather than get just these answered, it would make more sense to ask about things I may not even know, which I should address now rather than 6 months from now when we have a populated database and client tools using it.

First the questions on my database which have led me to realise I don't know enough:

  1. How do I ensure my many-to-many link tables and my one-to-many columns are up-to-date when changes are made to the referenced tables? What problems may I encounter?
  2. I am using nvarchar(n) and nvarchar(MAX) for various text fields. Should I use varchar equivalents instead (I had read there may be performance risks in using nvarchar)? Are there any other gotchas regarding the selection of datatypes besides being wary of using fixed length char arrays to hold variable length information? Any rules on how to select the appropriate datatype?
  3. I use int for the ID column of each table, which is my primary key in all but the link tables (where I have two primary keys, the IDs of the referenced table rows). This ID is set as the identity. Are there pitfalls to this approach?
  4. I have created metadata tables for things like unit types and statuses, but I don't know if this was the correct thing to do or not. Should you create new tables for things like enumerated lists or is there a better way?

I understand that databases are complex and the subject of many many worthy tomes, but I suspect many of you have some tips and tricks to augment such reading material (though tips for essential reading would also be welcome).

Community wiki'd due to the rather subjective nature of these kinds of posts. Apologies if this is a duplicate, I've conducted a number of searches for something like this but couldn't find any, though this one is certainly related. Thanks.

Update

I just found this question, which is very similar in a roundabout way.

+3  A: 
  1. Not normalising
  2. Not using normalisation
  3. Trying to implement a denormalised schema from the start

Seriously:

  1. Foreign keys will disallow deletes or updates from the parent tables. Or they can be cascaded.

  2. Small as possible: 2 recent SO questions datatypes and (n)varchar

  3. May not be portable and your "natural key" (say "product name") still needs a unique constraint. Otherwise no, but remember that an IDENTITY column is a "surrogate key"

Edit: Say you expect to store fruit with columns FruitID and FruitName. You have no way to restrict to one occurence of "Apple" or "Orange" because although this is your "natural key", you are using a surrogate key (FruitID). So, to maintain integrity, you need a unique constraint on FruitName

  1. Not sure or your meaning, sorry. Edit: Don't do it. Ye olde "One true lookup table" idea.
gbn
For 4, I have say units for values such as metres, feet, inches. I've created a table to hold this metadata, which is then referenced by another table that describes values (name, units, datatype, etc.).
Jeff Yates
But if I have metadata like Units where there is information about that to store such as the quantity type (length, mass, etc.), scale and offset to and from SI etc., doesn't it make sense to have this data in a table that is then referenced where needed?
Jeff Yates
Perhaps I'm missing the point still, sorry
gbn
No problem, my explanation may be off. Take your Fruit table and imagine I have another table where one of the columns is of type Fruit - do you use a table, like your Fruit table, for the values in that column, or do you have another method?
Jeff Yates
I'd define the column as FruitID with an FK to table Fruit. All my fruit data is in the Fruit table.
gbn
+1  A: 

It does sounds like you've got a good grasp on what you're meant to be doing, and indeed there isn't "one true path" to doing databases.

Have you set up cascades for your hierarchical objects (i.e., a single delete at the 'head' of your object in the database will delete all entries in tables relating to that entry)?

Your link tables and 1:n columns should be foreign keys, so there isn't much to worry about if the data changes. By "two primary keys" here, did you mean indexes?

As for metadata tables, I've done them in the past, and I've not done them. A single char status with SQL comment can suffice for a limited set of statuses, but beyond a certain amount, or where you can think of adding more in the future, you might want to have a reference to another table of metadata, or maybe a char(8ish). E.g., I've seen user tables have "NORMAL", "ADMIN", "SUPER", "GUEST", etc for user type, which could have been 1,2,3,4,5 fkeys to a "UserType" table, but with such a restricted enumeration does it matter? Other people have a table of permissions (booleans of what a user can do) instead - many ways to skin a cat.

JeeBee
I like your point on the statuses. A table for Units makes sense, but a table for status maybe does not (unless it has additional information like a longname for the status for use in UI somewhere).
Jeff Yates
I'd rather leave i18n out of the database (i.e., for your "ADMIN" status, use i18n framework to get "Administrator" in your front-end of choice). At some point a PHB is going to say "let's deploy this in our Spanish office" ;)
JeeBee
Another point I will add to the post in a bit - created and modified columns on tables. You already have an integer primary key, which is good, but it can be nice to be able to order data by creation or modification time.
JeeBee
That's an excellent point - I guess coming from a software background, I'm very wary of "magic numbers" that aren't enforced with a rigid meaning - just because a comment says 1 will be ADMIN, doesn't mean people remember to read that comment, etc. I think I need to adjust my thinking.
Jeff Yates
As to your second comment, yes, I can see the idea behind additional columns for created and modified dates, I'm just worried about the heck I keep all this up-to-date! :D
Jeff Yates
For created, that's just get_date() function in your insert SQL. For modified - set up a trigger on row update, or also put it in your SQL for update. And you can set up a foreign key on a char() entry, so you can have "ADMIN" in your User table, and "ADMIN", "Administrator", x, y, z in your UserType table.
JeeBee
+2  A: 

I'll reply to your subjective query with some vague generalities. :)

The most common pitfall of designing a database is the same pitfall of any programming solution, not fully understanding the problem being solved. In the case of a database, it is understanding the nature of the data. How big it is, how it comes and goes, what business rules must it adhere to.

Here are some questions to ponder.

What is updated the most frequently? Is keeping that table write-locked going to lock up queries? Will it become a hot spot? Even a seemingly well normalized schema can be a poor performer if you don't understand your read versus write ratios.

What are your external interface needs? I've been on projects where the dotted line to "that other system" nearly scuttled the whole project because implementing it was delayed until everything else was in place, that is to say, everything else was inflexible.

Any other unspoken requirements? My favorite is date sensitivity. All the data is there, your reports are beautiful, the boss looks them over and asks, when did that datum change? Who did it and when? Is the database supposed to track itself and its users, or just the data? Will your front end do it for you?

Just some things to think about.

kmarsh
Thanks, those are some great questions. I hadn't considered the loading of tables (as in, hot spots). How does one ensure a table is write-locked? Does it require special stored procedures or can it be done as an attribute of the table?
Jeff Yates
A properly installed ACID compliant RDBMS should write lock automatically as necessary on any insert, update or commit. Realizing that it does lock is half the battle. If every single commit, of any type, locks the same table/row/column, you know you have a hotspot. Other hotspots may not be as obvious until you are in production. Thinking about this aspect from the beginning goes a long way in identifying and avoiding hotspots.
kmarsh
A: 

I'd suggest a good book. The best IMO is this:

http://www.amazon.com/Server-2005-Database-Design-Optimization/dp/1590595297/ref=ntt_at_ep_dpt_1

AlexKuznetsov
A: 

In addition to not normalizing, a common problem I see is overindexing, done before there are performance measurements that take into account your in-production mix of reads vs. writes.

It's really, really easy to add an index to speed up a query, and harder to figure out which one to remove when you have several that are getting updated during an INSERT or UPDATE.

The middle ground is to go after obvious secondary indexes (e.g., for common, frequent lookups by name on large tables), deferring other candidate indexes until you have reasonable performance tests in place.

Dave W. Smith
Is an index the same as choosing primary keys?
Jeff Yates
Not necessarily. You can specify non-unique secondary indexes.
Dave W. Smith
A: 

You might find some usable stuff in these slides: [http://www.slideshare.net/billkarwin/sql-antipatterns-strike-back][1]

jandersson
+1  A: 

I also am a beginner to database design, but I found this online tutorial very, very helpful:

Database design with UML and SQL, 3rd edition

The author explains all the fundamental design aspects of database, and in a very clear manner. Before I found this online guide I did a lot of wikipedia reading about normalization. While that helped, this author explains the exact same stuff (through 3rd normal form, at least) but in a much, much easier to read way. It pretty much addresses all your questions as well.

JoeCool
Thanks, that's great!
Jeff Yates
A: 

Among other things, not using primary keys, not thinking ahead about whether you'll be using indexed views (and designing tables accordingly; I once had to drop and recreate a large table at my site to change its ANSI_NULL attribute to ON so that I could then use it with an indexed view), using indices.

Darth Continent