views:

3307

answers:

18

When designing tables, I've developed a habit of having one column that is unique and that I make the primary key. This is achieved in four ways depending on requirements:

  1. Identity integer column that auto increments.
  2. Unique identifier (GUID)
  3. A short character(x) or integer (or other relatively small numeric type) column that can serve as a row identifier column

Number 4 would be used for fairly small lookup, mostly read tables that might have a unique static length string code, or a numeric value such as a year or other number.

For the most part, all other tables will either have an auto-incrementing integer or unique identifier primary key.

The Question :-)

I have recently started working with databases that have no consistent row identifier and primary keys are currently clustered across various columns. Some examples:

  • datetime/character
  • datetime/integer
  • datetime/varchar
  • char/nvarchar/nvarchar

Is there a valid case for this? I would have always defined an identity or unique identifier column for these cases.

In addition there are many tables without primary keys at all. What are the valid reasons, if any, for this?

I'm trying to understand why tables were designed as they were, and it appears to be a big mess to me, but maybe there were good reasons for it.

EDIT

OK... Wow! A lot of great responses and discussion. I guess I hit on a topic that's a little bit religious without realizing it. :-)

A third question to sort of help me decipher the answers: In cases where multiple columns are used to comprise the compound primary key, is there a specific advantage to this method vs. a surrogate/artificial key? I'm thinking mostly in regards to performance, maintenance, administration, etc.?

EDIT 2

There are lots of good answers here, and it was hard to choose the "best" one, so I've chosen one I thought was helpful, but didn't receive as many votes, and up-voted the others that helped answer my question.

+3  A: 

I suspect Steven A. Lowe's rolled up newspaper therapy is required for the designer of the original data structure.

As an aside, guids as a primary key can be a performance hog. I wouldn't recommend it.

Andrew Rollings
To say its a performance hog is a premature optimization. Guids are required in some cases (disconnected clients, future table merging, replication)
JC
"Premature optimization" is an overused phrase on SO (IMHO)! Yes, GUIDs may be required in SOME cases, but Andrew is right to point out that they shouldn't be used as the default data type whether required or not.
Tony Andrews
If I buy my new car in summer; and I also buy a set of winter tires immediately, because experience tells me that it will snow in 6 months, is that premature optimisation, or planning ahead?
Treb
It's premature optimization. Unless there's a sale on snow tires, you would have been better off investing the money, and buying snow tires only when you actually needed them.
Kibbee
Or it is planning ahead, because I was at the car dealer anyway, and saved myself a second trip.
Treb
OK, it wasn't actually a premature optimization. What I meant is that most people don't experience the volume required to notice the performance difference. Yes, use autoincrement if you know you will never need a guid.
JC
Or use both. Have an int/long based primary key for nice quick selects and joins, and then have a guid field.At least, that's what I'm doing. Is this wrong? Should I not be doing that? :)
Andrew Rollings
@Treb: if you live in Florida, it's a complete waste of money. If you live in Anchorage, the car already comes with snow tires.
Steven A. Lowe
+7  A: 

Tables should have a primary key all the time. When it doesn't it should have been an AutoIncrement fields.

Sometime people omit primary key because they transfer a lot of data and it might slow down (depend of the database) the process. BUT, it should be added after it.

Some one comment about link table, this is right, it's an exception BUT fields should be FK to keep the integrity, and is some case those fields can be primary keys too if duplicate in links is not authorized... but to keep in a simple form because exception is something often in programming, primary key should be present to keep the integrity of your data.

Daok
I agree. And in the case where a lot of data is to be inserted, remove the primary key constraint (or use INSERT IDENTITY ON in TSQL) and put it back afterwards :)
Andrew Rollings
There are exceptions: link tables obviously
annakata
You are right annakata, I made an edit! Thx
Daok
Another reason: If there is no PK/unique key, table browsers (I mean, something like Access / SQL Server Management Studio) will refused to update/delete a single row with duplicated row. You'll have to write SQL for that.
Dennis Cheung
It's quite common to omit a PK from a data warehouse fact table. In Oracle you can reference the ROWID pseudocolumn as a unique identifier in the short term (ie. don't store it somewhere and expect it to not change)
David Aldridge
+5  A: 

A natural key, if available, is usually best. So, if datetime/char uniquely identifies the row and both parts are meaningful to the row, that's great.

If just the datetime is meaningful, and the char is just tacked on to make it unique, then you might as well just go with an identify field.

James Curran
Usually best? I don't have any scientific basis but I'm almost positive most people prefer a surrogate key over natural. In many cases there is no natural key.
JC
Just because most people like something doesn't mean it is better...
Tony Andrews
There should ALWAYS be a natural key for any row in your database. That "natural" key may be something generated in the business world or by your technical system, but it should always exist.
Tom H.
So you're calling a guid PK column a natural key?
JC
If, in your world, that's what has been determined to be the only way to identify a row in the table, then yes. Of course, when a designer chooses to create a GUID for a PK it's usually because they haven't done the work to find the REAL natural key, so in that case the GUID is NOT the natural key.
Tom H.
1. Asking the database to use string (etc) comparisons when doing indexing or joins doesn't feel right - ints are much more suitable for that.
Barry Fandango
2. If you take your key from the natural world, the natural world will change to break your key. If you use the telephone number, you'll get two users from the same household. If you use the last name, they get married. If you use SSN, privacy laws will change and require you remove them.
Barry Fandango
@Barry: RE: #2. if the natural world changes and that causes your natural key to change that means that you did a poor job selecting a natural key. By definition, a natural key does not change over time.
Tom H.
I should amend that... a natural key should be VERY VERY unlikely to change over time. Using a last name as part of a natural key is a bad decision on the designer's part and not a strike against natural keys.
Tom H.
+2  A: 

You should use a 'composite' or 'compound' primary key that comprises of multiple fields.

This is a perfectly acceptable solution, go here for more info :)

adam
+3  A: 

I too always use a numeric ID column. In oracle I use number(18,0) for no real reason above number(12,0) (or whatever is an int rather than a long), maybe I just don't want to ever worry about getting a few billion rows in the db!

I also include a created and modified column (type timestamp) for basic tracking, where it seems useful.

I don't mind setting up unique constraints on other combinations of columns, but I really like my id, created, modified baseline requirements.

JeeBee
I must also point out that I don't put IDs on link/join tables, only on the tables containing data.
JeeBee
A: 

We do a lot of joins and composite primary keys have just become a performance hog. A simple int or long takes care of many problems even though you are introducing a second candidate key, but it's a lot easier and more understandable to join on one field versus three.

Dan Blair
This strategy falls apart when you now have to traverse 6 tables to join the actual two tables that you need because composite keys were not propagated. It also ends up requiring the use of loops/cursors for multiple inserts which can be a HUGE performance hog.
Tom H.
I'm not to big to learn something new. I'd love to see an example of what you're saying, it would be helpful to inject a little rational fact into some of these religious arguments.
Dan Blair
+1  A: 

I always use an autonumber or identity field.

I worked for a client who had used SSN as a primary key and then because of HIPAA regulations was forced to change to a "MemberID" and it caused a ton of problems when updating the foreign keys in related tables. Sticking to a consistent standard of an identity column has helped me avoid a similar problem in all of my projects.

Matt
Poor selection of a natural key by a developer doesn't mean that natural keys are bad.
Tom H.
Can I upvote a comment?
Kibbee
+9  A: 

There´s no problem in making your primary key from various fields, that's a Natural Key.

You can use a Identity column (associated with a unique index on the candidate fields) to make a Surrogate Key.

That´s an old discussion. I prefer surrogate keys in most situations.

But there´s no excuse for the lack of a key.

RE: EDIT

Yeah, there´s a lot of controversy about that :D

I don´t see any obvious advantage on natural keys, besides the fact that they are the natural choice. You will always think in Name, SocialNumber - or something like that - instead of idPerson.

Surrogate keys are the answer to some of the problems that natural keys have (propagating changes for example).

As you get used to surrogates, it seems more clean, and manageable.

But in the end, you´ll find out that it's just a matter of taste - or mindset -. People "think better" with natural keys, and others don´t.

DonOctavioDelFlores
People "think better" with natural keys. Machines and databases, don't.
F.D.Castel
Yeah, but who designs them? That´s the point.
DonOctavioDelFlores
+16  A: 

Natural verses artifical keys is a kind of religious debate among the database community - see this article and others it links to. I'm neither in favour of always having artifical keys, nor of never having them. I would decide on a case-by-case basis, for example:

  • US States: I'd go for state_code ('TX' for Texas etc.), rather than state_id=1 for Texas
  • Employees: I'd usually create an artifical employee_id, because it's hard to find anything else that works. SSN or equivalent may work, but there could be issues like a new joiner who hasn't supplied his/her SSN yet.
  • Employee Salary History: (employee_id, start_date). I would not create an artifical employee_salary_history_id. What point would it serve (other than "foolish consistency")

Wherever artificial keys are used, you should always also declare unique constraints on the natural keys. For example, use state_id if you must, but then you'd better declare a unique constraint on state_code, otherwise you are sure to eventually end up with:

state_id    state_code   state_name
137         TX           Texas
...         ...          ...
249         TX           Texas
Tony Andrews
Both of the links in this post are (now?) bad.
Jon
@Jon. Thanks, now fixed. They were good at the time of posting!
Tony Andrews
+1  A: 

All tables should have a Primary key otherwise what you have is a HEAP - this , in some situations , might be what you want (Heavy insert load when the data is then replicated via Service broker to another database or table for instance).

Lookup tables with a low volume of rows you can use a 3 CHAR code as the Primary Key as this takes less room than an INT but the performance difference is neglibable. Other than that I would always use an INT unless you have a reference table that perhaps has a composite primary key made up foreign keys from associated tables.

Coolcoder
+1  A: 

If you really want to read through all of the back and forth on this age-old debate, do a search for "Natural Key" on StackedOverflow. You should get back pages of results.

Tom H.
+21  A: 

I follow a couple of rules:

  1. Primary keys should be as small as necessary. Prefer a numeric type because numeric types are stored in a much more compact format than character formats. This is because most primary keys will be foreign keys in another table as well as used in multiple indexes. The smaller your key, the smaller the index, the less pages in the cache you will use.
  2. Primary keys should never change. Updating a primary key should always be out of the question. This is because it is most likely to be used in multiple indexes and used as a foreign key. Updating a single primary key could cause of ripple effect of changes.

On surrogate vs natural key I refer to the rules above. If the natural key is small and will never change it can be used as a primary key. If the natural key is large or likely to change I use surrogate keys. If there is no primary key I still make a surrogate key because experience shows you will always add tables to your schema and wish you'd put a pk in place.

Logicalmind
I like it! Do you have any documentation for the basis of your "rules"? Thanks!
Lloyd Cotten
No, just experience. When dealing with "small" databases this stuff doesn't matter so much. But when you deal with large db's all of the little things matter. Just imagine if you have 1 billion rows with int or long pk's compared to using text or guid's. There's a huge difference!
Logicalmind
Just remember to put that unique index on the natural key (if one actually exists which is often not the case) when you use an artifical key.
HLGEM
+2  A: 

Natural verses artificial keys to me is a matter of how much of the business logic you want in your database. Social security number is a great example

"Each client in my database will, and must, have an SSN." Bam, done, make it the primary key and be done with it. Just remember when your business rules change you're burned.

I don't like natural keys myself, due to my experience with changing business rules. But if your sure it wont change, it might prevent a few critical joins.

Dan Williams
And I have seen data where SSN is not unique even though it should be. Be very wary of natural keys if you import your data from another source!
HLGEM
+2  A: 

I look for natural primary keys and use them where I can.

If no natural keys can be found, I prefer a GUID to a INT++ because SQL Server use trees, and it is bad to always add keys to the end in trees.

On tables that are many-to-many couplings I use a compound primary key of the foreign keys.

Because I'm lucky enough to use SQL Server I can study execution plans and statistics with the profiler and the query analyzer and find out how my keys are performing very easily.

Guge
Do you have any documentation to back up this statement: 'if no natural keys can be found, I prefer a GUID to a INT++ because SQL Server use trees, and it is bad to always add keys to the end in trees.' Not skeptical, just trying to compile some documentation.
Lloyd Cotten
@Lloyd - Glad you're taking an interest into something I find very fascinating myself. A good starting point at http://msdn.microsoft.com/en-us/library/ms177443(SQL.90).aspx
Guge
A: 

Guids can be used as a primary key, but you need to create the right type of Guid so that it performes well.

You need to generate COMB guids. Here is a good article about it and performance stats. http://www.informit.com/articles/article.aspx?p=25862&seqNum=7

Also here is some code on building COMB Guids in sql. http://dotnetslackers.com/Community/forums/uniqueidentifier-vs-identity/t/808.aspx?PageIndex=2

Donny V.
IMHO, guid's should only be used when you need to synchronize data across databases. In which an automatically generated id is problematic. The difference between using a guid and using a basic numeric type is that a guid will require 16 bytes per row, while a numeric will be much smaller.
Logicalmind
If you go to the link I provided above there is very little difference in performance using COMB Guids.
Donny V.
+5  A: 

Just an extra comment that is often overlooked. Sometimes not using a surrogate key has benefits in the child tables. Let's say we have a design that allows you to run multiple companies within the one database (maybe it's a hosted solution, or whatever).

Let's say we have these tables and columns:

Company:
  CompanyId   (pk)

CostCenter:
  CompanyId   (pk, fk to Company)
  CostCentre  (pk)

CostElement
  CompanyId   (pk, fk to Company)
  CostElement (pk)

Invoice:
  InvoiceId    (pk)
  CompanyId    (pk, in fk to CostCentre, in fk to CostElement)
  CostCentre   (in fk to CostCentre)
  CostElement  (in fk to CostElement)

Incase that last bit doesn't make sense, Invoice.CompanyId is part of two foreign keys, one to the CostCentre table and one to the CostElement table. The primary key is (InvoiceId, CompanyId).

In this model, it's not possible to screw-up and reference a CostElement from one company and a CostCentre from another company. If a surrogate key was used on the CostElement and CostCentre tables, it would be.

Less chances to screw up the better.

WW
This is an under-cited disadvantage when using surrogate keys. If the table has a surrogate key I can still use it for these kinds of constraints. Unfortunately though the constraint requires an index and it's just weird to create a unique index on (surrogate_key, other_column) when (surrogate_key) is unique by itself. Also, (other_column) is often totally redundant in a map table since (surrogate_key) is unique in the foreign one. Surrogates can really muck things up.
Samuel Danielson
A: 

I'll be up-front about my preference for natural keys - use them where possible, as they'll make your life of database administration a lot easier. I established a standard in our company that all tables have the following columns:

  • Row ID (GUID)
  • Creator (string; has a default of the current user's name (SUSER_SNAME() in T-SQL))
  • Created (DateTime)
  • Timestamp

Row ID has a unique key on it per table, and in any case is auto-generated per row (and permissions prevent anyone editing it), and is reasonably guaranteed to be unique across all tables and databases. If any ORM systems need a single ID key, this is the one to use.

Meanwhile, the actual PK is, if possible, a natural key. My internal rules are something like:

  • People - use surrogate key, e.g. INT. If it's internal, the Active Directory user GUID is an acceptable choice
  • Lookup tables (e.g. StatusCodes) - use a short CHAR code; it's easier to remember than INTs, and in many cases the paper forms and users will also use it for brevity (e.g. Status = "E" for "Expired", "A" for "Approved", "NADIS" for "No Asbestos Detected In Sample")
  • Linking tables - combination of FKs (e.g. EventId, AttendeeId)

So ideally you end up with a natural, human-readable and memorable PK, and an ORM-friendly one-ID-per-table GUID.

Caveat: the databases I maintain tend to the 100,000s of records rather than millions or billions, so if you have experience of larger systems which contraindicates my advice, feel free to ignore me!

Keith Williams
A: 

I really like the info I get from The Database Programmer blog. The article about primary keys is very good. Maybe you'll like it, too.

duffymo