views:

824

answers:

12

Choosing good primary keys, candidate keys and the foreign keys that use them is a vitally important database design task -- as much art as science. The design task has very specific design criteria.

What are the criteria?

A: 

Only one really, choose a surrogate for each table (identity/auto_number) or something similar that the users will never even see so you can do whatever is necessary with them whenever you need to now and in the future.

Tahbaza
So you are saying there are no criteria at all. Is that correct?
bbadour
It depends on whose criteria you refer to. Primary keys should be internal to a database and determined WITHOUT end user input in my opinion for performance and maintainability reasons.
Tahbaza
@Tahbaza I thought I was clear when I used words like "choose", "good", "database design", "design task" and "design criteria". Was I not?
bbadour
@Tahbaza Your reply does not answer the question. The use of a surrogate key does not obviate the need to choose other keys.
dportas
+3  A: 

A Primary Key is a key that uniquely identifies an entity. When you are choosing a primary key, the best choice is almost always a surrogate key that has absolutely nothing to do with the entity at all other than uniquely identifying it.

And that's it. There are supposedly rare edge cases where a primary key might be a natural key, but I've never seen a valid one.

Most of us use a 32-bit auto-increment integer as a primary key. Another excellent choice (in certain circumstances) is a UUID.

Randolpho
You seem to agree that no design criteria exist. Is that correct?
bbadour
@rockinthesixstring So you think uniqueness is the only criterion. Is that correct?
bbadour
If your answer is "The only criterion for choosing a good primary key is uniqueness", why didn't you say that in your answer instead of pasting in a definition from somewhere else?
bbadour
+14  A: 

What is a Primary Key?

The primary key is something that uniquely identifies a row/record of data. It can also be multiple columns, which is called a composite.

Ability to Change

Because the primary key is often used for foreign references, it should be as stable as possible. All data in the database is mutable, providing someone is connecting with an account that has appropriate privileges. This is why databases provide the ability to define CASCADE ON DELETE and CASCADE ON UPDATE--to sync referential dependencies without having to disable constraints.

Natural or Artifical/Surrogate?

Ideally, you want a natural key. A natural key is existing data that uniquely identifies the entity you are modeling. For example, the abbreviations of US states is a good natural key because the abbreviation is consistent and everyone knows them:

US_STATE_PRIMARY_KEY  US_STATE
-------------------------- 
AL                    Alabama
AK                    Alaska
AZ                    Arizona
AR                    Arkansas
CA                    California

Don't try too hard to find a natural key. They seldom exist. It's unlikely that a US State name would change, but it is plausible.

Realistically, primary keys will typically be artificial (often generated by database functionality). These are typically numbers or GUIDs, and they're considered artificial because on their own - there's nothing to relate their value to the information they uniquely identify. A sales receipt is always numbered, because there's nothing natural about it and it's also for auditing - gaps in the receipt numbers raise suspicions. To demonstrate how arbitrary numbering is, here's the US state table but using an integer for the primary key column, US_STATE_CODE:

US_STATE_PRIMARY_KEY  US_STATE
-------------------------- 
100                   Alabama
101                   Alaska
102                   Arizona
103                   Arkansas
104                   California

There's no requirement to start the value at one; some shops use this as a security measure to thwart SQL injection. The value is sequential based on the alphabetic ordering of the State name, but that can't be guaranteed. But unlike the natural key, if the state name changed - only one column would have to be updated.

Single Column vs Composite

Ideally one column will be the primary key, but make the decision based on the data at hand--do not combine columns just for the sake of having a single column. If you do shoehorn data together, use a character to separate the data easily (though operations to do this won't be able to take advantage of an index if present).

Performance

From a performance perspective, integers are best because they offer a decent range of values and the number of bytes used is small when you compare to VARCHAR of five or more characters.

OMG Ponies
If you have a primary key in a relation with other data, doesn't the relation relate the primary key to the data?
bbadour
@bbadour: That's a foreign key, if I understand you correctly. A foreign key can be a primary key, and generally is in many-to-many tables.
OMG Ponies
No, I was quite clear when I said "primary key". Had I meant "foreign key" I would have mentioned something about references.
bbadour
OMG Ponies
If that is supposed to address the question, I can't make any sense of what you wrote at all.
bbadour
@bbadour: Can you be more specific about what you don't understand?
OMG Ponies
I don't understand how you think you communicated anything sensible about how primary keys in relations relate to data.
bbadour
@bbadour: That's the first lesson with primary keys - do you have a naturally occurring value that uniquely identifies a record? If the answer is no, you need to use an artificial value (integer, GUID).
OMG Ponies
Now you seem to be saying that uniqueness and identification are design criteria. Is that correct?
bbadour
@bbadour: I don't think "identification" is the best choice to describe natural/artificial keys, but yes - both are design criteria.
OMG Ponies
If a key doesn't identify, what does it do?
bbadour
@bbadour: "identification" doesn't explain natural/artificial values for a key. Both are ultimately used to identify a record in a table, but as I've said numerous times now - an artificial key value, isolated from any of the data in the table, has no relational/identification significance. A value of "1" tells you nothing about the record it is attached to.
OMG Ponies
Sure it does if it is attached to it.
bbadour
To clarify, it says: `This is the one with 1 attached to it.`
bbadour
If you're going to use natural keys, be very careful that you pick one that is static. Phone numbers may be unique, but they can also be changed. Similarly, a company may change their employee ID format. Even SSNs can be changed or duplicated (in rare circumstances). State abbreviations could be a reasonable primary key as long as you're using it only as a primary key, not for displaying, since capitalization conventions can change,.
Lèse majesté
@Lèse majesté: Familiarity and stability often conflict. Any of the criteria can conflict. The art of design involves finding an effective trade-off among conflicting criteria. Different designers will make different trade-offs. Some better than others.
bbadour
OMG Ponies
Consider 2 equivalent tuples: {3296013,840082470,Bob Badour,PE,CA} or {840082470,Bob Badour,745} Which is more useful and informative?
bbadour
@bbadour: Modeling a person is a notorious poor example--first, middle and surnames can match but actually reflect different people. SSN/SIN are not reliable, because they can change (immigration, legally dead). A name doesn't guarantee gender, and gender can be changed. The situation requires using an artificial key for these numerous facts--there is no natural key with regard to people. DNA hasn't been able to distinguish between twins...
OMG Ponies
@OMG Ponies: None of your comments has any relevance to my example. My example does not use name as a candidate key for either tuple. Yet, even without attribute names or domain names, you recognized the domain, which speaks to the value of familiarity. In any case, I see I made a mistake in my example, I had intended {3296013,840082470,Bob Badour,745) versus {840082470,Bob Badour,PE,CA}
bbadour
@bbadour: That's because your example has nothing to do with the question - "usefulness" is meaningless to primary key criteria. "Informative" I reach to construe as being a natural key, which is a common modeling convention I've yet to see you accept. Your statement reinforces the use of an artificial key. I don't understand what you fail to grasp, and why you are so combative.
OMG Ponies
I don't know why you assume I am combative or why you assume I fail to grasp something. Perhaps you project your own flaws onto me? In my corrected example, the 1st tuple follows the rule of always using a surrogate key, and the 2nd tuple makes effective design tradeoffs. Can you tell what 745 means by looking at it? What does 3296013 achieve that was not already achieved by 840082470, which happens to be the primary key for my academic records at any or every post-secondary school in Canada. A good design is correct, elegant and useful. I contend the 1st example tuple fails on 2 counts.
bbadour
@bbadour: If you grasp the issue, you'd have marked the answer. When I asked for a specific thing you had issue with, you evaded the question by attacking my answer. When I re-iterated my original example, you ignored it while supplying data of your own design asking "which is more useful and informative?". Now you require design to be "elegant" and "useful"... What that has to do with primary key criteria, I don't know. All I get from your latest response is that you comprehend how arbitrary a numeric value as a primary key is.
OMG Ponies
If you had given a clear, sensible, correct and useful answer, I would have marked the answer. You have not. If I cannot extract any sensible meaning from what you write, I cannot sensibly respond. I suggest that those, who think they extracted sensible meaning from what you wrote, merely projected their own assumptions onto it to fill in the gaps. You call that attacking your answer, and I call that honestly admitting my own limitations.
bbadour
OMG Ponies
I can only suggest you look for the flaws in your answer instead of flaws in me. Your resources would be better allocated that way by any number of measures.
bbadour
@bbadour: I'll address flaws when you actually describe them. You are the target audience
OMG Ponies
I already described the flaws in what you wrote. ::shrug::
bbadour
In my experience, natural keys are almost always the source of all drama in database design. About the only use they have is avoiding joins. The rest of the time you're wasting fixing issue such as when your natural key isn't as unique as you thought, or dealing with cascading updates that become burdensome.
Will
@bbadour: sorry, but I'm with OMG Ponies here - from you comments I can't figure out what you are objecting to. His answer could be unclear, but can you concede your wording could be too as well? Since I consider this answer as pretty good and complete, I'm really interested in what I could miss. So could you try to put your arguments in other words, please (one flaw at a time)?
Alexander Malakhov
@bbadour, regarding your points: **1.** "Can you tell what 745 means by looking at it" - no, you just do *JOIN* and show what user needs, no problems here. **2.** "What does 3296013 achieve that was not already achieved by 840082470" - if you're sure 840082470 will not change till the end of the world, use it as PK. He clearly stated this in USA states example, right ? On the other hand, why do you have to guess whether social number or whatever will change in future or not ? IMO memory required for one integer column absolutely worth 100% guarantee, that your references will not be broken
Alexander Malakhov
@Alexander, you assume 1) the person knows how to join, 2) the person has authority to query the referenced table, 3) the person knows how to find the referenced table, 4) the data is still in the dbms where one can perform a join, 5) no cost accrues from discovering a join is required. If any of your assumptions is false, everything you predicate on them is pointless.
bbadour
OMG Ponies
@Will - I disagree. Data migrations and consolidations are significantly easier with natural keys. Keep in mind that the use of a surrogate key does not obviate the need for another unique key on the table. Users must have means of differentiating one row from another by something other than the surrogate key. Think about the headache of consolidating two tables of States where the surrogate key for each State between the two systems are different. E.g. CA = 1 in system A and 99 in system B.
Thomas
@bbadour - Regarding your last post, am probably not the only one that has no idea where you are going with your comment. If you use a surrogate key, the system designer and the application designer must know to use them and the data must exist (along with the database, computer systems, planet earth...). So?
Thomas
@OMG Ponies - I'm not sure I've ever heard the term "artificial" key. Rather, I read that these were called surrogate keys. However, I'll grant you that "artificial" seems more diametric than "surrogate".
Thomas
@bbadour: **1)**, **3)** Well, if your developers can't figure it out or you can't explain it to them in 10 minutes, you are in biiig trouble :). **2)** It makes no sense to give access to main table and not to give to lookup table. **4)** How can it be elsewhere ? If you meant data corruption, this doesn't relate to design or even implementation. **5)** Do you mean performance issues ? If so, then yes - sometimes you need to de-normalize tables, but no one said you have to be zealous. And that is rare case anyway. So, have I convinced you ?
Alexander Malakhov
@Thomas: Very good point - "surrogate" is the more common terminology, but "artificial" from an explanation standpoint makes *way* more sense. I'm guilty of fighting terminology, like how people call derived tables/inline views *and* subselects "subqueries". My redemption lies with things making sense rather than buzzwords ...I think/hope =)
OMG Ponies
@Alexander and @Thomas: You seem to have some perverse notion that good database designers design only for a select group of expert users who never pull data out into a spreadsheet. What an absurd notion!
bbadour
@bbadour - There is no inherent means in a spreadsheet to enforce any of the elements of database theory. Why talk about the attributes of a good primary choice for people using a tool that cannot enforce the fundamental aspects of a primary key? Is it really that crazy that we'd assume that someone asking about elements of database design would be using a database management system of some sort?
Thomas
@bbadour: Sorry, I don't see what is your point. Do you mean "asking from developers to know what is *JOIN* " is too much, or that "users will have to deal with raw DB and they will be confused by surrogate keys" or what Thomas said ?
Alexander Malakhov
@bbadour: The database design is for optimally storing the data, not making it available to users--that's what UI is for. Expert spreadsheet users don't do hierarchical/recursive queries, and long for the speed of aggregate functions because they have to do that work *by hand*.
OMG Ponies
@Thomas I completely agree that, per your example, that a States table should use either the state abbreviation or the state name as its key. States don't ever change their name and are guaranteed to be unique. You picked one of the few instances where it is the perfect solution. One of the few, IMHO.
Will
The five assumptions I enumerated are assumptions. During normal use of a database management system by various users, one or more of those assumptions will often be false. If someone cannot conceive of an assumption as an assumption or if someone cannot imagine a situation where an assumption could be false, I can offer no remedy.
bbadour
@OMG Ponies: You seem not to understand that the purpose of a database management system is to manage data for all users of the system, or that managing includes making data available to users of the system.
bbadour
+7  A: 

Database design starts with a conceptual data model (such as an entity relationship diagram) and finishes up with a database schema or schemas. Entities are mapped to tables; in this process one entity may be split into several table, several entities may be merged into a single table and new tables may arise (for instance, intersection tables to implement many-to-many relationships).

In an ERD entities have primary keys. These are natural keys, that is they are attributes of the entity. For a PERSON entity it might be SocialSecurityNumber. For an ORDER entity if might be OrderRef For an INVOICE entity it might be InvoiceNo. In the first case that is a real-life identifier; in the second case it is a smart key in an ugly format (2010/DEF/000023 ); in the third case it is a monotonically incrementing number because that is what the current paper-based system uses.

Natural keys can be fanciful. I once worked on a database design where the analyst had specified the CUSTOMER entity with a key of (FullName, Address, Sex, DateOfBirth, DistinguishingCharacteristics) on the basis that two individuals of the same name, birth date and gender could live at the same address.

The characteristics of an entity's primary key are:

  • unique
  • familiar
  • stable (presumed)
  • minimal (one or more attributes but as few as necessary)

When it comes to primary keys for database tables, natural keys are not always suitable.

There are many reasons not to use SSN as a physical primary key. Protection of a citizen's personal data is actually the most important but it is also the case that an individual's number can change. Primary keys should be unvarying.

Smart keys are dumb. They are actually compound keys compressed into a single column. They are better represented as separate columns, not least because it is a frequent requirement to search on single elements of the key. Also, the format of such keys can change.

In general compound keys are a pain as primary keys because we have to cascade multiple columns as foreign keys. This is exacerbated when the child's primary key is defined as a serial number within the parent's primary key. There are systems out there which dependent tables inheriting a nine-column foreign key from a parent when they have a scant two data columns of their own. Sometimes this sort of inheritance can be useful but mostly it is a just a hassle.

The characteristics of an entity's primary key are:

  • unique
  • appropriate (meaningless)
  • guaranteed stability
  • minimal, usually a single column (except for intersection tables)

So unless the candidate key is a meaningless identifier (such as InvoiceNo) a table should have a synthetic key (AKA surrogate key). This can be a monotonically incrementing number or a GUID according to your needs. Regarding intersection tables, if they have no other attributes or dependent tables there is no value in replacing a compound primary key (AKA composite key) with a synthetic one.

The crucial thing is: we still enforce the candidate keys. This means applying UNIQUE constraints on those columns - SSN , OrderRef - in the parent table. This is because a synthetic key uniquely identifies a row in a table, it does not uniquely identify the data.


Regarding familiarity

Familiarity is a curly one. It is an important consideration when it comes to we are identifying primary keys in a conceptual data model but it is less useful when it comes to database design.

In a commnet @bbadour provides two contrasting examples:

{3296013,840082470,Bob Badour,745} versus {840082470,Bob Badour,PE,CA}  

and poses the question:

"What does 3296013 achieve that was not already achieved by 840082470, which happens to be the primary key for my academic records at any or every post-secondary school in Canada."

Well, 840082470 is like a invoice number. Of itself it is a meaningless string of digits. If the system we are designing belongs to the domain of Canadian higher education then it is certainly acceptable as a candidate key. However, because it is a key apparently owned by an external central system (forgive me for not understanding the Canadian academic system), it is open to some of the objections to SSN as a primary key. We are reliant on that external system to ensure uniqueness, guarantee stability and verify identification.

As for 745 versus PE,CA, that is clearly wrong. The Canadian postal abbreviation for "Prince Edward Island" and the ISO digraph for "Canada" identify two distinct pieces of information and derive from different sources, so they should be represented as two separate columns. But let us focus on whether 745 or PE makes the better primary key.

First thing, the database doesn't care which data type we use for the code to represent "Prince Edward Island". It just wants guaranteed uniqueness.

Second thing, the user-facing part of the system is likely to display the full expansion "Prince Edward Island", in which case the application is going to need to execute a look-up anyway. This is because users of a system which also holds addresses from the country of Peru or the state of California will appreciate the clarity of the expanded names[1]. Certainly if we go beyond the few hard cases (such as state abbreviations) the application should always expand codes when displaying them to users.

Thus the only advantage of using PE rather than 745 is that it makes ad hoc querying easier.

Third thing, if the code expansion changes we might want to distinguish records which use the newer version. This is a lot easier if 745='Prince Edward Island' and 746='Prince Edward Is.' than if we use PE as the primary key.

Fourth thing, there are programming considerations. For instance, if the application developers have to provide drop-down lists using Java Enumerations they need numeric codes.

In short, familiarity of natural keys is not as useful as the practicality of surrogate keys.

[1] Canadians will know that CA stands for Canada. But does MO stand for Morocco, Monaco, Moldova, Montenegro, Mongolia or Montserrat? Actually none of them: it's Macau.

APC
I understand how keys relate to logical data models. I don't see what they have to do with the physical at all.
bbadour
@bbadour - I have expanded my answer. My original response was a placeholder as your question had four close votes, and indeed attracted the fifth while I was editing.
APC
Fair enough; however, I still don't see what keys have to do with the physical at all. It seems you include "unique identification", "unvarying", "not compound". Not the words I would use, but certainly 3 design criteria. I am somewhat amused that you equate drawing ER diagrams with logical modeling. You seem to say "physicalizing" involves both primary and foreign keys. I don't see what either have to do with the physical.
bbadour
I am amused because ERM is a methodology for **conceptual analysis** as is NIAM, for example. ERD is a graphical language used for conceptual analysis, as is ORM, for example. None of that really has anything to do with **logical design**; although, tools for ERD and ORM often blur things by offering features to automate large parts of logical design. I downvoted your answer for confusing some of the most fundamental vocabulary in IT as reflected in the ISO/IEC 2382 standard vocabularies.
bbadour
You fixed some of the most egregious bits so I removed my downvote. Thank you!
bbadour
It's rare, but there have also been instances of duplicate SSN's (I suppose the social security administration fixes it at some point though)
Wade Williams
Even without the IRS assigning duplicate SSN's, one must expect to encounter them due to illegal immigrants making up SSN's to file their taxes. Being undocumented means getting assigned a date for a deportation hearing and released. Tax evasion gets prison time.
bbadour
I'd upvote the answer if SSN weren't used as an example -- it's a terrible example of a key, because of fraudulent duplication, reassignments, mistaken duplication, and the fact that not everybody has one!
Dave Markle
@DaveMarkie - In my answer I say that that SSN is not suitable as a primary key, although I admit the reasons I give are not as exhaustive as your list.
APC
+2  A: 

Hey. it's open again. Here goes.

(1) Choose good candidate keys.

It does not pertain to the database designer to choose candidate keys. The database designer has the responsibility to see to it that all the uniqueness requirements he is informed of by the user, will be enforced. So it is the user who "chooses" what the candidate keys are.

There are two scenario's I can think of that relax this unequivocal position a bit.

One is if the user says that some attribute of type 'video' or 'audio' (or some such) is to be unique. It may be infeasible to actually enforce that, and it is the designer's responsibility to point that out to the user (as it is also his responsibility to point out that 'uniqueness' of audio and video content is a very debatable subject, and that any uniqueness on such attribute values, even if enforcible by the system, still has a good chance of not being the same uniqueness that the user wants).

Second is how the picture gets muddied by the possibility of distinct logical designs all addressing the same problem. If D1 and D2 are both valid designs addressing the same problem, then it might be the case that a certain given uniqueness rule imposed by the user, is enforcible using keys in D1, but not in D2. From this perspective, "choosing candidate keys" can be interpreted as "choosing a particular design such that a given uniqueness rule is enforcible using keys". But that wasn't really the question that you asked.

(2) Choose good primary keys.

A while ago, Darwen launched the question "What are good reasons to single out one particular candidate from among the others as being 'primary' ?". Nothing much came out, except then perhaps : "to suggest that this particular key is the preferred one to use whenever making references to this relvar". I suspect they didn't find that convincing enough to change their earlier decision that "no key is more unique than any other".

But, supposing that nonetheless there exists some valid reason to single out one particular key as "primary", I suppose the following considerations apply :

  • the likeliness, or appropriateness, of using this primary key also as, e.g., the clustering key in the physical design.
  • and as a consequence of that, the probability of having to change a value of some existing primary key. Key values that are highly stable will be preferable over key values that are more volatile.
  • the percentage of the business that naturally uses some such key in their daily operations.
  • if the required space for physically encoding key values is significantly different, which one has the smallest encoding size.
Erwin Smout
I prefer to see an answer that more clearly separates the physical considerations. Keys are logical constructs. Insofar as some product limitations might force one to choose different keys to implement partitioning, clustering or indexing, I prefer an answer that makes clear those decisions relate specifically to working around product limitations--not good design.
bbadour
I agree that choosing a primary key merely designates one candidate key as preferred for foreign key references. However, even if we eliminated the name "primary key" entirely, designers must still choose which candidate key to propagate into another relation for reference purposes. If users identify a heavily referenced relation with an unstable, composite key, do you intend to imply that the designer has no business choosing an additional simple, stable key? Or using the simple, stable key for referencing the relation? Your candidate key section seems to imply that.
bbadour
A: 
Joshua
@Joshua: I cannot make any sense of your answer at all. Range queries don't seem to have anything at all to do with keys. Did you mean intervals? Intervals and the sorts of problems identified and solved in Date, Darwen and Lorentzos' book on Temporal Databases?
bbadour
Intervals are a type of range queries.
Joshua
@Joshua - You seem to be conflating "Primary Key" with "Clustered Index". In Microsoft SQL Server it is true that PKs will default to being the clustered index if you don't specify any different but they are entirely different concepts.
Martin Smith
@Joshua: I still cannot make much sense of your answer. It seems you are saying that whatever foreign key reference is most queried defines the primary key. However, during design, one can decide which of several candidate keys to propagate to another relation for the purposes of referencing.
bbadour
@Martin: no I am not. Just try to declare a primary key and clustered index that differ.
Joshua
@bbadour: you are aware that there is actually no requirement that every foreign key relationship use the same key of the potential candidate keys right?
Joshua
@Joshua - That is perfectly possible (I'm still assuming you are talking about SQL Server. Syntax Follows) `create table #X (col1 int not null primary key nonclustered,col2 int) create clustered index ix ON #X (col2)`
Martin Smith
@Joshua: you are misinformed. Keys and indexes are completely different concepts. Keys and *clustered* indexes even more so. I don't know of any DBMS that requires a primary key and a "clustered index" to refer to the same set of columns - certainly not SQL Server. However, the topic is keys generally, it's not a question about the limitations of some particular software so your answer is irrelevant.
dportas
ah "primary key nonclustered" -- never heard of that before.
Joshua
Yes, Joshua, I am aware of that. I am not aware of any difference that makes to whether designers decide which candidate keys to use for different references. Now that you have heard of a nonclustered primary key, might you consider changing your answer or deleting it?
bbadour
Joshua: in my opinion, you underestimate the importance of primary keys in regular databases if you design them based on some "dominant query".
Stefan Steinegger
The day should come when I write my own book on databases. I have reason to believe the theory as taught in the classroom is actually wrong when it comes to practical applications. I appear to have underestimated the value of primary keys. I have simply transferred that value to candidate keys.
Joshua
@Joshua: A primary key is a candidate key. I don't know how you "underestimate" the value of them. Candidate keys are essential to getting correct results from relational database design. The topic of this question is design theory and that is not invalidated by the limitations of indexes in one or other versions of one particular DBMS. If the limitations of your DBMS software make some aspect of good design seem inconvenient to you then maybe you should use some better software rather than rewrite the textbooks to suit it.
dportas
The difference between primary key and candidate key is there can be only one primary key.
Joshua
+3  A: 

Your answer to Erwin: "I agree that choosing a primary key merely designates one candidate key as preferred for foreign key references. However, even if we eliminated the name "primary key" entirely, designers must still choose which candidate key to propagate into another relation for reference purposes. If users identify a heavily referenced relation with an unstable, composite key, do you intend to imply that the designer has no business choosing an additional simple, stable key? Or using the simple, stable key for referencing the relation? Your candidate key section seems to imply that. – bbadour 8 hours ago "

Your original question was about 'primary keys'. Now you change your focus to keys and foreign keys. A key is an integrity constraint, so the only criteria are that a minimal set of attributes has to be unique in a relation (uniqueness and irreducibility). If we change our focus to foreign keys then simplicity, stability and familiarity are the criteria to choose from all the candidate keys in de referenced relation. There could be more candidate keys that fulfill that criteria to more or less the same extend. If we look at familiarity, one candidate key could be very familiar to a group of users and not to another group for which another candidate key is more familiar. Think about different views or subschemas of a database. This second group of users should choose a different candidate key for reference purposes (as foreign key). If you insist in 'primary keys' of which we only have one per relation then I have to ask what makes a key more primary than others. I think the term primary key should not be used. At least at the logical level. Also the term 'foreign keys' is not well chosen (foreign keys are not keys, but references).

So, I think the remarks of Erwin about ‘primary’ keys were very much to the point. Or at least this was my interpretation of what he means.

Do you agree with this? If so, would you change your original question to "What are the design criteria for keys and what are the criteria to choose a foreign key from the available candidate keys?"? If not, why?

Regards, Carlos

Carlos M. Calvelo
Carlos, my original question was about database design and emphasized that fact. Further, I clearly tagged the question as being about foreign keys right from the start as one cannot discuss the topic of choosing good primary and candidate keys without involving foreign keys. Others have removed the emphasis and altered the tags. I merely clarified the role of foreign keys in my question when I put the tag back so that the next passerby won't remove it again.
bbadour
Both Erwin's answer and your answer are good enough to get a vote from me, but neither answers the question directly enough, clearly enough or elaborated enough to warrant the bounty. While your answer is the 1st to fully enumerate the design criteria, it does so as an aside while nitpicking the question instead of answering it. No, I won't change the question for 2 important reasons: 1) brevity and 2) clarity.
bbadour
@bbadour: My answer was also about 'logical' database design. And I think we should not talk about foreign keys referencing primary keys but candidate keys. I think I was clear about that. As an aside: candidate to what? to become primary? Lets drop the ‘candidate’ thing too: simply ‘keys’. I had to ask what makes a primary key more primary than others. I also explained that the criteria are different for determining if a set of attributes should be a key and for choosing a foreign key. Bottom line: talking about 'primary' keys is problematic.
Carlos M. Calvelo
@bbadour: Even if you don’t change your question, would you agree that my reformulation (trying to forget about ‘primary’ keys) of your question is less problematic? As an aside: It was not my intention to give an elaborated answer (I’m not doing homework here) or to get a bounty. If I understand and I’m being understood I’m happy.
Carlos M. Calvelo
No, I would not agree. A candidate key is a candidate for referencing. A primary key provides guidance on how best to reference a relation. I don't find my question problematic. I would hope that anyone motivated to answer a question would be motivated to provide their best answer regardless of incentive. ::shrug::
bbadour
@bbadour: I was wrongly thinking the terms candidate and primary were being used in a more formal and restrictive sense, but they are only for pragmatic guidance in design. I understand now why you would not change your question. Thank you for the clarifications.
Carlos M. Calvelo
+2  A: 

A primary key is a candidate key chosen for special treatment, so first we must look at the properties of candidate keys. A set of one or more columns is a candidate key if it has the following two properties:

Uniqueness: A candidate key must uniquely identify each row in a table. No table may contain two rows with the same value for the candidate key.

Irreducability: Removing any column from a candidate key must violate the uniqness property. In other words, no subset of columns in a candidate key is itself a candidate key.

If no candidate key exists, and sometimes even if one does, a surrogate key is often created using an auto-incrementing integer column, or made up using some other technique. This surrogate key is now also a candidate key.

It is often useful to choose among the available candidate keys and to designate one of them as the primary key. The first criteria often applied is simplicity indicating the candidate key with the fewest columns. However there are other potential criteria, like familiarity, familiar values being more useful than non-familiar values, and stability, stable keys being less troublesome than keys that are apt to change. These criteria however, are strictlty outside the scope the relational model, often conflict with each other, and are often made to deal with implementation limitations.

I would say that the first two concepts "uniqueness" and "irreducability" are less design criteria than fundamental properties of primary keys, while the latter concepts of "simplicity", "familiarity" and "stability" are more properly labeled design criteria, as they involve tradeoffs and subjectivity.

Why choose a primary key? Simplicity and familiarity are not only criteria for choosing among available candididate keys, but are why we should choose a primary key at all. If there are are multiple candidate keys in a table, it simplifys things if all foreign keys pointing to that table refer to the same candidate key. Furthermore, the very act of choosing a particular candidate key will help make it familiar.

Paul Mansour
+1  A: 

(Not quite sure how to interpret this question. Sounds like a quiz or something where you are looking for one single "right" answer from a textbook. I'm going to interpret the question as a more practical one, hence my advice below.)

At least in the MS SQL world, discussion about a proper Primary Key is inevitably wrapped up in discussion about the proper clustered index for a table. The two don't have to be the same, but they are by default, and for many tables, making the two the same is often a good idea.

For the purpose of our discussion here, its important to distinguish between the two:

A PRIMARY KEY is a field or combination of fields that uniquely identify a row.
A CLUSTERED INDEX is a field or combination of fields that represents the physical ordering of a table. (Again, I am speaking about MS SQL Server, not sure how other RDBS might handle this)

Key to the remainder of my discussion is knowing that since SQL 7.0, the clustered index key is used as a row identifier for all non-clustered indexes. This means that many of the same criteria for choosing a good clustering key are the same as for choosing a good primary key.

Let's first look at the criteria for a good clustered index (From Kimberly Tripp's excellent article). A clustered index should be:

  1. Unique - otherwise useless as a row identifier for other indexes
  2. Narrow - this key is used in other indexes, so should be as narrow as possible
  3. Static - If key values change, then references become invalid and will need updating
  4. Ever-increasing - To reduce physical table fragmentation as new rows are added

It is readily apparent the first 3 are also good criteria for a primary key. #4 is a bonus that will reduce table fragmentation as tables grow.

A GUID as a primary key, as popular as that is, actually fails 2 of these criteria (Narrow and Ever-Increasing). As such, it is not recommended as a PK/Clustered index in most circumstances (see Kim's related article here)

BradC
Thank you for your answer Brad. My question is not product-specific. For your benefit, I have asked a new question [How does the primary key keyword relate to clustered indexes in MS SqlServer?](http://stackoverflow.com/questions/3662163/how-does-the-primary-key-keyword-related-to-clustered-indexes-in-ms-sqlserver) Here, I can give your answer a vote. There, I can pick it as the best answer. If you would be so kind as to copy and paste it there, you would serve this site very well. Thank you!
bbadour
+2  A: 

What are the criteria?

A PRIMARY KEY is something that will define the entity, only the entity and nothing but the entity.

  • You can take it from the outside world. Say, a star catalog number to identify a star (good example), or an SSN to identify a person (bad example).

    In this case, you rely on the outside world.

    • Do all people have SSN? (They don't).
    • Are SSN's unique? (They aren't).
    • Can an SSN be assigned to another person? (It can).
  • You can generate it inside your model, using AUTOINCREMENT or GUIDs or whatever.

    In this case, you rely on yourself and your database skills.

    • Do all people in your model have an ID? (Yes, they do, otherwise they wouldn't be in the table with ID NOT NULL).
    • Are these ID's unique? (Yes, they are, the PRIMARY KEY constraint takes care of it).
    • Can they be assigned to other persons? (No, they cannot, they are either non-repeatable by design or auto incrementing).

    Or another set of answers:

    • Do all people in your model have an ID? (No, they don't, the people table was accidentally dropped, though some other information retained).
    • Are these ID's unique? (No, we failed to merge two versions of the database properly).
    • Can they be assigned to other persons? (Yes, we reset the AUTOINCREMENT by mistake).

The most important thing is that a surrogate key is a feast that is always with you. You can always create a surrogate key: nothing on Earth can stop you from declaring an AUTOINCREMENT field. But by far not all things have some kind of identifier everybody agrees upon.

However, a good natural key cannot be overemphasized.

Guide Star Catalog database is most probably backed up more reliably than yours, and the list of US state codes you always can restore right from the memory.

Quassnoi
+3  A: 

A candidate key is a set of attributes that are irreducibly unique (irreducible meaning that no attribute can be removed from the key without losing the uniqueness property).

Other criteria when choosing what candidate keys to implement are: simplicity, stability, familiarity.

These three criteria are important considerations but not necessarily essential attributes of a key. For instance it may be desirable and quite reasonable to enforce a key that can change often. e.g.: a user login name is required to be unique but the user may change it at will as long as it remains unique.

A primary key is a candidate key.

dportas
+4  A: 

The criteria for consideration of a primary key are:

  • Uniqueness
  • Irreducibility (no subset of the key uniquely identifies a row in the table)
  • Simplicity (so that relational representation & manipulation can be simpler)
  • Stability (should not be altered frequently)
  • Familiarity (meaningful to the user)
BradC
I was going to write up a more complete answer, but some things came up keeping me from doing so. In the end, I decided to award the bounty to Brad for persistence, clarity and brevity.
bbadour