views:

2284

answers:

7

I'm just wondering what the optimal solution is here.

Say I have a normalized database. The primary key of the whole system is a varchar. What I'm wondering is should I relate this varchar to an int for normalization or leave it? It's simpler to leave as a varchar, but it might be more optimal

For instance I can have

People
======================
name      varchar(10)   
DoB       DateTime    
Height    int  

Phone_Number
======================
name      varchar(10)   
number    varchar(15)

Or I could have

People
======================
id        int Identity   
name      varchar(10)   
DoB       DateTime  
Height    int  

Phone_Number
======================
id        int   
number    varchar(15)

Add several other one-to-many relationships of course.

What do you all think? Which is better and why?

+6  A: 

Can you really use names as primary keys? Isn't there a high risk of several people with the same name?

If you really are so lucky that your name attribute can be used as primary key, then - by all means - use that. Often, though, you will have to make something up, like a customer_id, etc.

And finally: "NAME" is a reserved word in at least one DBMS, so consider using something else, e.g. fullname.

Troels Arvin
Definitely. It's just a simple example for the theory of it.
theo
@Troels Arvin: +1, include the fact that you're duplicating data all over the place using VARCHAR rather than just a single integer type.
sixlettervariables
+2  A: 

I think if your VARCHAR was larger you would notice you're duplicating quite a bit of data throughout the database. Whereas if you went with a numeric ID column, you're not duplicating nearly the same amount of data when adding foreign key columns to other tables.

Moreover, textual data is a royal pain in terms of comparisons, your life is much easier when you're doing WHERE id = user_id versus WHERE name LIKE inputname (or something similar).

sixlettervariables
Optimization: Don't perform premature optimization. And if you have several identifiers for the same thing (int and varchar), you may end up joining more often.Comparisons: You often have to deal with the name anyways because you are interacting with a real World (customer/sales enters name, etc).
Troels Arvin
Code optimisation is not the same as database optimisation; using identity fields for relationships is best practise and the optimal way of doing things. Don't assign a code paradigm to database design, especially as in this case names are not suitable primary keys.
blowdart
It is _not_ "best practice" to use identity fields. Best practice is that it depends on the situation. By the way: This touches one of the oldest discussions in the database field - a discussion which will probably never end.
Troels Arvin
Regardless of if you consider it a good practice to use text as keys, there are far more ways to get textual keys wrong than integer keys. No refuting that point.
sixlettervariables
+5  A: 

Using any kind of non-synthetic data (i.e. anything from the user, as opposed to generated by the application) as a PK is problematic; you have to worry about culture/localization differences, case sensitivity (and other issues depending on DB collation), can result in data problems if/when that user-entered data ever changes, etc.

Using non-user-generated data (Sequential GUIDs (or non-sequential if your DB doesn't support them or you don't care about page splits) or identity ints (if you don't need GUIDs)) is much easier and much safer.

Regarding duplicate data: I don't see how using non-synthetic keys protects you from that. You still have issues where the user enters "Bob Smith" instead of "Bob K. Smith" or "Smith, Bob" or "bob smith" etc. Duplication management is necessary (and pretty much identical) regardless of whether your key is synthetic or non-synthetic, and non-synthetic keys have a host of other potential issues that synthetic keys neatly avoid.

Many projects don't need to worry about that (tightly constrained collation choices avoid many of them, for example) but in general I prefer synthetic keys. This is not to say you can't be successful with organic keys, clearly you can, but for many projects they're not the better choice.

technophile
Using a PK based on auto-generated values can lead to poor data quality because duplicated information may easier creep into the database, and because you start having data in your database which you can't verify in the real World.
Troels Arvin
@Troels Fortunately we are all entitled to our own opinion but I completely disagree with your statement.
Darrel Miller
@technophile: +1 spot on.
sixlettervariables
+5  A: 

I believe that the majority of people who have developed any significant sized real world database applications will tell you that surrogate keys are the only realistic solution.
I know the academic community will disagree but that is the difference between theoretical purity and practicality.

Any reasonable sized query that has to do joins between tables that use non-surrogate keys where some tables have composite primary keys quickly becomes unmaintainable.

Darrel Miller
@Darrel Miller: +1, the only DB's I see without surrogate keys are those that are in MS Access or some SQL Server DB that came from Access.
sixlettervariables
I don't think academics are really against surrogate keys.
BobbyShaftoe
+1  A: 

If the "name" field really is appropriate as a primary key, then do it. The database will not get more normalized by creating a surrogate key in that case. You will get some duplicate strings for foreign keys, but that is not a normalization issue, since the FK constraint guarantrees integrity on strings just as it would on surrogate keys.

However you are not explaining what the "name" is. In practice it is very seldom that a string is appropriate as a primary key. If it is the name of a person, it wont work as a PK, since more than one person can have the same name, people can change names and so on.

JacquesB
A: 

One thing that others don't seem to have mentioned is that joins on int fields tend to perform better than joins on varchar fields.

And I definitely would always use a surrogate key over using names (of people or businesses) because they are never unique over time. In our database, for instance, we have 164 names with over 100 instances of the same name. This clearly shows the dangers of considering using name as a key field.

HLGEM
Joins perform best when you don't join at all. For example, with organic keys you don't have to join to look up phone numbers by name.
Constantin
You do if you have them in a different table as they normally are since phone numbers are a one-to-many relationship to names.
HLGEM
+1  A: 

The original question is not one of normalization. If you have a normalized database, as you stated, then you do not need to change it for normalization reasons.

There are really two issues in your question. The first is whether ints or varchars a preferable for use as primary keys and foreign keys. The second is whether you can use the natural keys given in the problem definition, or whether you should generate a synthetic key (surrogate key) to take the place of the natural key.

ints are a little more concise than varchars, and a little more efficient for such things as index processing. But the difference is not overwhelming. You should probably not make your decision on this basis alone.

The question of whether the natural key provided really works as a natural key or not is much more significant. The problem of duplicates in a "name" column is not the only problem. There is also the problem of what happens when a person changes her name. This problem probably doesn't surface in the example you've given, but it does surface in lots of other database applications. An example would be the transcript over four years of all the courses taken by a student. A woman might get married and change her name in the course of four years, and now you're stuck.

You either have to leave the name unchanged, in which case it no longer agrees with the real world, or update it retroactively in all the courses the person took, which makes the database disagree with the printed rosters made at the time.

If you do decide on a synthetic key, you now have to decide whether or not the application is going to reveal the value of the synthetic key to the user community. That's another whole can of worms, and beyond the scope of this discussion.

Walter Mitty