In a fairly animated discussion in my team I was made to think what most people like as primary keys. We had the following groups-

  1. Int/ BigInt which autoincrement are good enough primary keys.
  2. There should be at least 3 columns that make up the primary key.
  3. Id, GUID and human readable row identifiers all should be treated differently.

Whats the best approach for PKs? It would be awesome if you could justify your opinion. Is there a better approach that the above?

EDIT: Anyone has a simple sample/algorithm to generate human readable identifiers for rows that scales well?


This is a classic "it depends". There's no one right answer for every project. I like different things for different situations. It depends on whether I'm using an ORM and what it supports. It depends on the overall architecture (distributed or not, etc). Just pick one that you think will work and move on to arguing over tabs and spaces.

John Sheehan
He still wish's to know HOW it depends; only with awareness of these can one come to trust one's self to choose...
Nicholas Leonard
+3  A: 

I'm a fan of the auto-increment as primary key. I know deep in my heart that this is a cop-out, but it does make it so easy to sort data by when it was added (ORDER BY ID DESC, f'r instance).

3 columns sounds awfully harsh to humanly parse.

And that's the trade-off -- how much of the relational capability do you need, versus making THIS TABLE RIGHT HERE understandable to a human interrogating it (versus the stored-procedure or programmatic interface).

auto-increment is for us humans. :-(

Michael Paulukonis
+4  A: 

Generally, it depends.

Personally, I like autoincrement ints.

But, one thing I can tell you is to never trust data from other sources as your key. I swear, every time I've done that it comes back to bite me. Well, never again!

+34  A: 

If you're going to be doing any syncing between databases with occasionally connected apps, then you should be using GUIDs for your primary keys. It is kind of a pain for debugging, so apart from that case I tend to stick to ints that autoincrement.

Autoincrement ints should be your default, and not using them should be justified.

Bramha Ghosh
This really is the right answer, GUID = Global Unique IDentifier. This is what you want from a PK isn't it? Globally identify a piece of data?
A GUID isn't necessary, just change the step to 10 or 20 or however many server's you'll need to possibly sync with in the future.
Robert C. Barth
GUIDs hose your index fill factor if you use them as your clustering index. Should use a GUID as an AK really.
90% of the time at least, a GUID is not needed and wastes space.
Jonathan Leffler
And, unless you're really careful, can hose index fragmentation as well, since the most common algorithms used to create them, puts the "random" portion of the guid in the most significant bit positions... This increases the requirement for index defragmentation / Reindexing
Charles Bretana
I seriously feel GUIDs is an overkill. Never ever had a need to have GUIDs as my primary keys yet.
Cyril Gupta
@bramhaghosh yeah, guids suck... thanks.
John Boker
@stephbu: the SQL Server function newsequentialid() solves the index fragmentation problem, though 24 bytes is still a tad excessive if you don't absolutely need global uniqueness. See
oops, I meant to say 16 bytes
Or, instead of wasting space and risking collision with a GUID, make a composite key of the original primary key and a small identifier, where the small identifier is different for each sync source.
+1  A: 

I tend to use option #1 or #3 depending on the size, the number of people connecting, and whether it is a multiple database server situation or not.

Option #2 doesn't make much sense to me. If any one of the three is not enough to identify a unique record, then it's possible (without going through extra machinations) two have two records show up with the same values in all three columns. If you want to enforce uniqueness on any combination of the three, then just add an index for them.

+14  A: 

I like The Database Programmer blog as a source for this kind of info.

3 columns for a primary key? I would say that columns should have appropriate unique constraints as the business rules demand, but I'd still have a separate surrogate key. Compound keys mean business logic enters into the key. If the logic changes, your whole schema is screwed.

They changed their link, here is the updated bookmark:
Redbeard 0x0A
+3  A: 

There should be atleast 3 columns that make up the primary key.

I don't understand this.

Are you talking about a "natural key", e.g. "name and date of birth"? A natural key might be ideal if it exists, but most candidates for a natural key are either not unique (several people with the same name) or not constant (someone can change their name).

Int/ BigInt which autoincrement are good enough primary keys.

I prefer Guid. A potential problem with autoincrement is that the value (e.g. "order id") is assigned by the database instance (e.g. by the "sales database") ... which won't entirely work (instead you start to need compound keys) if you ever need to merge data created by more than one database instance (e.g. from several sales offices each with their own database).

Primary keys are required to be unique, but are not required to be constant. Hence foreign keys declared with "ON UPDATE CASCADE". But making an assumption that primary keys are constant helps to simplify many applications. This is one benefit of surrogate keys.
Bill Karwin


In the event that you need to scale out or you need to assign the primary key by alternate means they will be your friend. You can add indexes for everything else.

Chris Lively
You ever examined the fill factor of your indexes? Random nature of GUID's make 'em swiss cheese - dramatically reducing their effectiveness.
"Guids.period": That is so wrong. GUIDs should be used where appropriate. As the other commenter pointed out, it might make life as a programmer easy, but affects the overall size and performance of the DB.
Mitch Wheat
No: GUID as a last resort.
Jonathan Leffler
At the end of the day, I can scale my apps out across multiple database servers without issue. But I guess you guys work on small sites.
Chris Lively
GUID *might* be ok for the logical primary key, but NEVER EVER EVER use a GUID column as your CLUSTERING key - you'll be drowning in index fragmentation leading to POOR performance .....
I certainly wouldn't proclaim "Guids.period." on this topic -- in fact even in an industry so chock full of 'best practices' that sort of statement puts you on shaky ground by default (particularly with that statement). Anything as painful to deal with as a GUID needs some hard justification and as JL says, I think most of us would consider it a last resort. It's as if you posted without reading the rest of the thread.

This is a complex subject whether you realized it or not. Might fall under the section on this StackOverflow FAQ.

What kind of questions should I not ask here?

Avoid asking questions that are subjective, argumentative, or require extended discussion. This is a place for questions that can be answered!

This has been debated for years and will continue to be debated for years. The only hints of consensus I have seen is that the answers are somewhat predictable depending on if you are asking a OO guy (GUIDs are the only way to go!), a data modeler (Natural keys are the only way to go!), or a performance oriented DBA (INTs are the only way to go!).

I will not let the discussion go to long. I was just curious to see the general consensus.
I say ask whatever questions you wish! Else, this community will become static and overcontrolled like wikipedia seems to have become. Seems to me like some times you need to let people ask whatever do choose to ask. Trust them, and they might come to trust themselves!
Nicholas Leonard
+1  A: 

I've only use an auto-increment int or a GUID. 99% of the time I've use auto-increment int. It's just what I was taught to use when I first learned about databases and have never run into a reason not to use them (although I know of reasons why a GUID would be better).

I like auto increment ints because it helps with readability. For example I can say "take a look at record 129383" and it's pretty easy for someone to go in and find it. With a GUID that's nearly impossible to do.

Then you were taught badly.
Jonathan Leffler
Why do you say that? It seems many people use an auto-increment integer. It can't be that bad if it works and works well for what you need.
+1  A: 

Past a basic definitional answer, what constitutes a good primary key is left largely to religion and break room arguments. If you have something that is, and will always, map uniquely to an individual row, then it will work fine as a primary key. Past that point, there are other considerations:

  • Is the primary key definition not overly complex? Does it avoid introducing unnecessary complexity for the sake of following a "best-practice"?
  • Is there a better possible primary key that would require less overhead for the database to handle (i.e. INTEGER vs. VARCHAR, etc)?
  • Am I ABSOLUTELY certain that the uniqueness and defined-ness invariant of my primary key will not change?

This last one is likely what draws most people to use things like GUIDs or self-incrementing integer columns, because relying on things like addresses, phone numbers, first/last names, etc, just don't cut it. The only invariant about people I can think of is SSNs, but then I'm not even 100% certain about those remaining forever unique.

Hopefully this helps add some clarity...

Ed Carrel
There are some historical cases where SSN's are not unique.
Bill Karwin
+1  A: 

The way I approach primary keys (and I feel is the best) is to avoid having a "default" approach. This means instead of just slapping on an auto-incrementing integer and calling it a day I look at the problem and say "is there a column or group of columns that will always be unqiue and won't change?" If the answer is yes then I take that approach.

Andrew G. Johnson
Does that mean you 'avoid auto-incrementing integers whenever you can'? My understanding was that industry experts thought the best performance on large-scale databases comes from minimal-signature, indexed, incremental single-column PKs.
I always thought experts used the best tool for the job
Andrew G. Johnson
+1  A: 

Almost always integers.

They have other good reasons besides being smaller/faster to process. Which would you rather write down - "404040" or "3463b5a2-a02b-4fd4-aa0f-1d3c0450026c"?

Ant P.
The latter may be an integer, with dashes added and in base 16. But yes, 404040 is faster to process than the long GUID. Then again, 0 is even faster to process because it doesn't require a single bit of data!
+10  A: 

This is only a religious issue because people seek a universal right answer. The fact that both your team and this SO thread shows so much disagreement should be a clue that there are good reasons to use all the solutions you describe, in different circumstances.

  • Surrogate keys are useful when no other attribute or set of attributes in the table is suitable to identify rows uniquely.
  • Natural keys are preferred, when possible, to make the table more human-readable. Natural keys also allow the foreign key in a dependent table to contain a real value instead of a surrogate id. E.g. when you need to store state (CA, TX, NY) you might as well use a char(2) natural key instead of an int.
  • Use compound primary keys where appropriate. Do not add an "id" surrogate key unnecessarily when a perfectly good compound key exists (this is especially true in many-to-many tables). A mandate for a three-column key in every table is absolute nonsense.
  • GUIDs are a solution when you need to preserve uniqueness over multiple sites. They are also handy if you need values in the primary key to be unique, but not ordered or consecutive.
  • INT vs. BIGINT: it's not common that a table requires a 64-bit range for primary keys, but with the increasing availability of 64-bit hardware it shouldn't be a burden, and gives more assurance that you won't overflow. INT is of course smaller, so if space is at a premium it can give a slight advantage.
Bill Karwin
I disagree as much as a person can possibly do so. Natural keys are horrible. What if one wants to change the data? Oh, you can't. Writing joins on composite natural keys is a pain. Carrying that composite key to all your related tables is a waste.
Robert C. Barth
@Robert: read about "ON UPDATE CASCADE". But I get what you are saying, and I agree it's best to use a surrogate key most of the time, because attributes are subject to change and to be non-unique.
Bill Karwin
Primary keys should be immutable. Cascade updates are only a ugly hack for a bad design decision in this case. Natural keys are NEVER preferred. Same to composite keys, that spread themselves like a plague. Anyone with more than 3 months of database development experience would know this.
@F.D.: I don't agree with your unequivocal statement, and I've been developing with SQL databases since 1992. But certainly it's true that surrogate keys are best able to remain immutable.
Bill Karwin
+5  A: 

I think the use of the word "Primary", in the phrase "Primary" Key is in a real sense, misleading.

First, use the definition that a "key" is an attribute or set of attributes that must be unique within the table,

Then, having any key serves several often mutually inconsistent purposes.

  1. To use as joins conditions to one or many records in child tables which have a relationship to this parent table. (Explicitly or implicitly defining a Foreign Key in those child tables)
  2. (related) Ensuring that child records must have a parent record in the parent tab;e (The child table FK must exist as Key in the parent table)
  3. To increase perforamce of queries that need to rapidly locate a specific record/row in the table.

  4. To ensure data consistency by preventing duplicate rows which represent the same logical entity from being inserted itno the table. (This is often called a "natural" key, and should consist of table (entity) attributes which are relatively invariant.)

Clearly, any non-meaningfull, non-natural key (like a GUID or an auto-generated integer is totally incapable of satisfying #4.

But often, with many (most) tables, a totally natural key which can provide #4 will often consist of multiple attributes and be excessively wide, or so wide that using it for purposes #1, #2, or #3 will cause unacceptable performance consequencecs.

The answer is simple. Use both. Use a simple auto-Generating integral key for all Joins and FKs in other child tables, but ensure that every table that requires data consistency (very few tables don't) have an alternate natural unique key that will prevent inserts of inconsistent data rows... Plus, if you always have both, then all the objections against using a natural key (what if it changes? I have to change every place it is referenced as a FK) become moot, as you are not using it for that... You are only using it in the one table where it is a PK, to avoid inconsistent duplciate data...

As to GUIDs, be very careful using them, as using guids in an index can hose index fragmentation. The most common algorithms used to create them puts the "random" portion of the guid in the most significant bit positions... This increases the requirement for regular index defragmentation / Reindexing as new rows are added.

Charles Bretana
The SQL Server function newsequentialid() solves the index fragmentation problem of GUIDs (though 24 bytes is still a tad excessive if you don't absolutely need global uniqueness). See
oops, I meant to say 16 bytes.
+5  A: 

I always go with the surrogate key. A surrogate key (usually an identity column, autoincrement, or GUID) is one in which the key is not present in the data itself. A natural key, on the other hand, is one that, on its own, uniquely identifies the row. As near as I can tell in life, there are hardly any real natural keys. Not even things like SSN in the United States is a natural key. Composite primary keys are a disaster waiting to happen. You can't edit any of that data (which is the major drawback of any natural key, composite or not), but worse is that with a composite key, now you have to perpetuate that key data into every related table. What a giant waste.

Now, for selection of the surrogate key, I stick with identity columns (I work mostly in MS SQL Server). GUID's are too large and Microsoft recommends against using them as a PK. If you have multiple servers, all you need to do is make the increment 10 or 20 or whatever you think the maximum number of servers you'll ever need to sync/expand to, and just inc the seed for each table on each subsequent server, and you'll never have a data collision.

Of course, because of the increment, I make the identity column a BigInt (otherwise known as a long [64 bits]).

Doing a bit of math, even if you make the increment 100, you can still have 92,233,720,368,547,758 (> 92 quadrillion) rows in your table.

Robert C. Barth
+5  A: 

One thing you should never do is use a smart key. That is a key where information about the record is coded in the key itself, and it will eventually bite you.

I worked one place, where the primary key was the account ID, which was a combination of letters and numbers. I don't remember any specifics, but, for example, those accounts that were of a certain type, would be in the 600 range, and of another type, started with 400. That was great, until that customer decided to ask for both types of work. Or changed the type of work they did.

Another place, used the location in the tree as the primary key for records. So there would be records like the following.


Of course, the first thing the customers wanted was a way to move items in the tree around. The entire set of software died before that happened.

Please, please, please, if you're writing code that I ever have to maintain, please don't use a smart key!

I agree wholeheartedly. Smartkeys = dumb.
Robert C. Barth
+2  A: 

Auto increment columns. I am able to make my code work seamlessly with SQL Server or Oracle, one using identity the other using sequences through my DAL, and I couldn't be happier. I agree, GUIDs sometimes are necessary if you are doing replication or sending data away to receive it later on afer processing.

Otávio Décio
+1  A: 

Only slightly relevant, but one thing I've started doing recently when I have small classification tables (essentially those that would represent ENUMs in code) is that I'll make the primary key a char(3) or char(4). Then I make those primary keys representative of the lookup value.

For example, I have a quoting system for our internal Sales Agents. We have "Cost Categories" that every quote line item is assigned one of... So I have a type lookup table called 'tCostCategories', where primary key is 'MTL', 'SVC', 'TRV', 'TAX', 'ODC'. Other columns in the lookup table store more details, such as the normal english meanings of the codes, "Material", "Service", "Travel", "Taxes", "Other Direct Costs", and so forth.

This is really nice because it doesn't use any more space than an int, and when you are looking at the source data, you don't have to link the lookup table to know what the heck the value is. For example, a quote row might look like:

1 PartNumber $40 MTL
2 OtherPartNumber $29.99 SVC
3 PartNumber2 $150 TRV

It's much easier that using an int to represent the categories and then linking 1, 2, 3 on all the lines - you have the data right there in front of you, and the performance doesn't seem affected at all (not that I've truly tested.)

As far as the real question goes... I like RowGUID uniqueidentifiers. I'm not 100% on this, but don't all rows have internal RowGuid's anyway?? If so, then using the RowGuid would actually take less space than ints (or anything else for that matter.) All I know is that if it's good enough for M$ to use in GreatPlains then it's good enough for me. (Should I duck??)

Michael Bray
+1  A: 

Oh one more reason I use GUIDs - I use a hierarchical data structure. That is, I have a table 'Company' and a table 'Vendor' for which the Primary Keys match up. But I also have a table 'Manufacturer' that also 'inherits' from Company. The fields that are common to Vendors and Manufacturers don't appear in those tables - they appear in Company. In this setup, using int's is much more painful than Guids. In the very least, you can't use identity primary keys.

Michael Bray
Yes you can, you just don't make the subtype tables have the identity property, instead they get explicit inserts of the supertype table value. Please see
+2  A: 


Watch out if this is going to be a really Really REALLY REALLY big database, lots of load, and fast access.

At my last job, where we had databases of 100 to 500 million records, our database guys strongly argued against GUIDs, and for an appropriately sized decimal number. They felt that (under Oracle) the size difference in the internal storage for a string Guid - vs- a decimal value would make a very noticeable difference in lookups. ( Bigger keys = deeper trees to traverse)

The random nature of GUIDs also reduces the fill-factor for index pages significantly - this dramatically increases tearing and disk I/O.

"Reduces the fill-factor"? Not sure what that could meanFill-factor is a one shot deal, defined as the percent of free space requested at the leaf-level of the index at the time the index is built.GUID values by their random nature distribution across the breadth of the leaf-level on inserts into that free-space that fill-factor provided.
Ralph Shillington
Since when is a GUID a string? GUIDs should be stored internally as 16 bytes by any respectable DBMS. Storing as 32 bytes in the hex representation would be unconscionable! (or 36 with dashes, or 38 with curly braces)
+20  A: 

I don't see an answer which points out (what I regard as) the really fundamental point - namely, that a primary key is what guarantees that you won't get two entries in the table for the same real-world entity (as modelled in the database). This observation helps establish what are good and what are bad choices for primary key.

For example, in a table of (US) state names and codes, either the name or the code could be the primary key - they constitute two different candidate keys, and one of them (normally the shorter - the code) is chosen as the primary key. In the theory of functional dependencies (and join dependencies - 1NF through 5NF - it is the candidate keys that are crucial rather than a primary key.

For a counter-example, human names generally make a bad choice for primary key. There are many people who go by the name "John Smith" or some other similar names; even taking middle names into account (remember: not everyone has one - for example, I don't), there is plenty of scope for duplication. Consequently, people do not use names as primary keys. They invent artificial keys such as the Social Security Number (SSN) or Employee Number and use them to designate the individual.

An ideal primary key is short, unique, memorable, and natural. Of these characteristics, uniqueness is mandatory; the rest have to flex given the constraints of real world data.

When it comes to determining the primary key of a given table, therefore, you have to look at what that table represents. What set or sets of column values in the table uniquely identifies each row in the table? Those are the candidate keys. Now, if each candidate key consists of 4 or 5 columns, then you might decide that those are too clumsy to make a good primary key (primarily on grounds of shortness). In those circumstances, you might introduce a surrogate key - an artificially generated number. Very often (but not always) a simple 32-bit integer is sufficient for the surrogate key. You then designate this surrogate key as the primary key.

However, you must still ensure that the other candidate keys (for the surrogate key is a candidate key too, as well as the chosen primary key) are all maintained as unique identifier - normally by placing a unique constraint on those sets of columns.

Sometimes, people find it difficult to identify what makes a row unique, but there should be something to do that, because simply repeating a piece of information doesn't make it any more true. And if you're not careful and do get two (or more) rows purporting to store the same information, and you then need to update the information, there is a danger (especially if you use cursors) that you will update just one row rather than every row, so the rows are out of synchrony and no-one knows which row contains the correct information.

This is a pretty hard-line view, in some respects.

I've no particular problem with using a GUID when they are needed, but they tend to be big (as in 16-64 bytes), and they are used too often. Very often a perfectly good 4-byte value would suffice. Using a GUID where a 4-byte value would suffice wastes disk space, and slows up even indexed access to the data since there are fewer values per index page, so the index will be deeper and more pages have to be read to get to the information.

Jonathan Leffler
Regarding your sample with US state names I would prefer a separate surrogate key, simply because the codes are something beyond your control. If they should change for whatever reason you get a problem.
(continued) For example, Germany replaced a 4-digit zip code system with a 5-digit system back in the 1990ies after the re-unification.
@divo: I am a strong advocate of artificial/surrogate keys, but even I don't see the 4-digit to 5-digit postal code change as being a good example. Postal codes are not generally used as keys to anything. (When's the last time you had to query a PostalCode table to find out something about that code? No, it's almost exclusively used as part of an address without being referenced in any other tables. I would say your suggestion is almost on par with using surrogate keys for addresses themselves.)
@Emtucifor: Yes, maybe ZIP is not a very practical example, but my point was that if part of your surrogate key is out of your control and changes for whatever reason, you are in trouble. Think of someone creating a new social security number scheme, a new ISSN scheme or - maybe more realistic - a company deciding to create a new product id system after a merge, assigning new employee numbers to their employees to adjust their growth etc. These are all just fictional examples, but, as my previous example with the ZIP shows, sometimes a well-established system might change.
@divo: Agreed that Social Security Number is a good example.

I've always used a surrogate key - an autoincrementing integer called 'id'. I can see plenty of reasons to do this even when another option is obvious:

  • Consistency
  • Data independent (unique, not destroyed by changes to format)
  • Human-readable

...and no sensible reason not to:

  • Ambiguity in joins? - Aliasing tables is a better practice, IMHO
  • Optimum tables? - Removing one byte per entry is premature optimisation, IMHO
  • Per-table decision? - No longer consistent
  • Scaling problems? - Eh? Why?
  • Hierarchical data structure? - That's denormalising, a whole other subject of religion. Suffice it to say I'm a fan in a few circumstances in theory, but never in practice :)

sensible reasons against that I haven't thought of or come across yet are always welcomed...

+4  A: 

Slightly off-topic, but I feel compelled to chime in with...

If your primary key is a GUID, do not make it a clustered index. Since GUIDs are non-sequential, the data will be re-arranged on disk during almost every insert. (Yuck.) If using GUIDs as primary keys, they should be nonclustered indexes.

Very good point - one needs to distinguish between the *LOGICAL* concept of a primary key (might be valid to use a GUID for that, especially if replication is involved), and the *PHYSICAL* concept of the clustering key - that should *NEVER* be a GUID since it leads to excessive index fragmentation
This is in fact not accurate. The data will be inserted in order, which given the GUID's random nature could end up being anywhere across the table. On the off chance that there isn't room, a page split will happen, but certainly not "re-arranging on disk during every insert" not even close.
Ralph Shillington
@Ralph, you're right, not EVERY insert, but enough to cause a 20x performance hit.
The SQL Server function newsequentialid() solves the index fragmentation problem with GUIDs (though 24 bytes is still a tad excessive if you don't absolutely need global uniqueness). See
Sorry... meant 16 bytes
+1  A: 

I like natural keys, whenever I can trust them. I'm willing to pay a small performance price price in order to use keys that make sense to the subject matter experts.

For tables that describe entities, there should be a simple natural key that identifies individual instances the same way the subject matter people do. If the subject matter does not have trustworthy identifiers for one of the entities, then I'll resort to a surrogate key.

For tables that describe relationships, I use a compound key, where each component references an entity that participates in the relationship, and therefore a row in an entity table. Again, the performance hit for using a compound key is generally minimal.

As others have pointed out, the term "primary key" is a little misleading. In the Relational Data Model, the term that's used is "candidate keys". There could be several candidate keys for a single table. Logically, each one is just as good as another. Choosing one of them as "primary" and making all references via that key is simply a choice the designer can make.

Walter Mitty
Please describe some sample trustworthy natural keys?
"trustworthy" is not a property of a key by itself. Rather, it has to do with the key in the context of the people who supply the data. If you are writing an app to be sold to somebody who will actually be managing the data, you have to guess which keys will be trustworthy to the client or not. Given the variety of clients, you're almost surely going to guess wrong for some fraction of your clientele.
Walter Mitty
Having said the above, here's an example of a key that we trusted way back when. We had a database about courses. It included textbooks and other course materials about courses, scheduled course offerings, instructors who were qualified to teach courses, course prerequisites, tuition, and so on.When course development created a new course, one of the first things they did was to assign a course code. They were responsible for making sure that course codes were unique, and that courses never changed their code, once assigned. It was part of the data as given to us.
Walter Mitty
Another good example of trusted natural key is VIN (Vehicle Identification Number). For the last many years, every vehicle sold as new has a VIN attached to it. They can be trusted to be unique and unchanging.
Walter Mitty
+6  A: 

I like mine unique.

Andrew from NZSG