views:

738

answers:

10

This seems like a duplicate even as I ask it, but I searched and didn't find it. It seems like a good question for SO -- even though I'm sure I can find it on many blogs etc. out there. SO will have more of a debate than you can get on a blog.

I'm running into an issue with a join: getting back too many records. I think of this as "expansion". I added a table to the set of joins and the number of rows expanded, way too much. Usually when this happens I add a select of all the ID fields that are involved in the join. That way it's pretty obvious where the expansion is happening and I can change the ON of the join to fix it. Except in this case, the table that I added doesn't have an ID field. To me, this is a problem. But perhaps I'm wrong.

The question: should every table in a database have an IDENTITY field that's used as the PK? Are there any drawbacks to having an ID field in every table? What if you're reasonably sure this table will never be used in a PK/FK relationship?

Related, but not duplicate: http://stackoverflow.com/questions/932913/when-having-an-identity-column-is-not-a-good-idea

Apparently this debate has been going on for a while. Shoulda known.

This post (surrogate vs. natural keys) is also relevant.

+7  A: 

Almost always yes. I generally default to including an identity field unless there's a compelling reason not to. I rarely encounter such reasons, and the cost of the identity field is minimal, so generally I include.

Only thing I can think of off the top of my head where I didn't was a highly specialized database that was being used more as a datastore than a relational database where the DBMS was being used for nearly every feature except significant relational modelling. (It was a high volume, high turnover data buffer thing.)

Greg D
I think it's probably good to note that this is my personal approach (obviously) and I'm not necessarily claiming that it's the _right_ approach. There could quite possible be better philosophies out there.
Greg D
Use an artificial key (identity in SQL Server parlance) as the pk if a natural key is not forthcoming.
OMG Ponies
I've immensely regretted EVERY time I tried to shortcut and avoid having an arbitrary primary key. I'm certain that I will include a non-domain-driven primary key in EVERY database I design from here on...
Will Shaver
@Will: I think that this comment deserves an answer of its own; maybe discuss your opinion of non-domain-driven primary keys
jcollum
+1  A: 

I'd say, if you can find a simple, natural key in your table (i.e. one column), use that as a key instead of an identity column.

I generally give every table some kind of unique identifier, whether it is natural or generated, because then I am guaranteed that every row is uniquely identified somehow.

Personally, I avoid IDENTITY (incrementing identity columns, like 1, 2, 3, 4) columns like the plague. They cause a lot of hassle, especially if you delete rows from that table. I use generated uniqueidentifiers instead if there is no natural key in the table.

Anyway, no idea if this is the accepted practice, just seems right to me. YMMV.

Matthew Jones
What's the downside of having holes in your set of identity values (when a row is deleted)? I don't see one.
jcollum
Not a whole lot. The problem can occur if you need to keep the incremental items in order (like using the IDENTITY column as both identifier and ordering). Good design, however, minimizes the problems. I just prefer uniqueidentifiers.
Matthew Jones
You are aware of the massive performance drawbacks of GUIDs as clustered key in a SQL Server scenario, right?
marc_s
+4  A: 

Every table should have some set of field(s) that uniquely identify it. Whether or not there is a numeric identifier field separate from the data fields will depend on the domain you are attempting to model. Not all data easily falls into the 'single numeric id' paradigm, and as such it would be inappropriate to force it. Given that, a lot of data does easily fit in this paradigm and as such would call for such an identifier. There is no one answer to always do X in any programming environment, and this is another example.

Mark Roddy
Give an example of data that can't be modeled as a table that has a unique PK identity with a unique index on the set of fields that make the entry unique vis a vis the domain. I can't think of any. Wow I hope that makes sense; a little sleep deprived here.
jcollum
@jcollum I wasn't trying to say that it _can't_ be done, but that there are cases where it doesn't necessarily make the most sense to do so. I'm sure that for every case data could be modeled with a single identifier primary key, but in some edge cases, not using a numeric identifier primary case makes more sense. The point I would want to stress is that even though the identity approach makes sense in the majority of cases there are exceptions where this is not the case. As an example, last year I worked on a db with deep hierarchical data where the leaf items had large (multi-gig) tables
Mark Roddy
OK, after reading what you said I may rephrase the question... this sort of thing is so rarely binary yes/no that it seems dumb to ask the question that way
jcollum
Agreed, very few design decisions are binary, but there's value in determining what the common case is while acknowledging the edge cases as well.
Mark Roddy
A: 

I can't think of any drawback about having an ID field in each table. Providing your the type of your ID field provides enough space for your table to grow.

However, you don't necessarily need a single field to ensure the identity of your rows. So no, a single ID field is not mandatory.

Primary and Foreign Keys can consist not only of one field, but of multiple fields. This is typical for tables implementing a N-N relationship.

You can perfectly have PRIMARY KEY (fa, fb) on your table:

CREATE TABLE t(fa INT , fb INT);
ALTER TABLE t ADD PRIMARY KEY(fa , fb);
jeje
+2  A: 

Yes, for the vast majority of cases.

Edge cases or exceptions might be things like:

  • two-way join tables to model m:n relationships
  • temporary tables used for bulk-inserting huge amounts of data

But other than that, I think there is no good reason against having a primary key to uniquely identify each row in a table, and in my opinion, using an IDENTITY field is one of the best choices (I prefer surrogate keys over natural keys - they're more reliable, stable, never changing etc.).

Marc

marc_s
can you explain the downvote?? What's wrong with my answer?
marc_s
i hear ya, hate it when people downvote without explaining why, but i've gotten over it now
jcollum
+12  A: 

There are two concepts that are close but should not be confused: IDENTITY and PRIMARY KEY

Every table (except for the rare conditions) should have a PRIMARY KEY, that is a value or a set of values that uniquely identify a row.

See here for discussion why.

IDENTITY is a property of a column in SQL Server which means that the column will be filled automatically with incrementing values.

Due to the nature of this property, the values of this column are inherently UNIQUE.

However, no UNIQUE constraint or UNIQUE index is automatically created on IDENTITY column, and after issuing SET IDENTITY_INSERT ON it's possible to insert duplicate values into an IDENTITY column, unless it had been explicity UNIQUE constrained.

The IDENTITY column should not necessarily be a PRIMARY KEY, but most often it's used to fill the surrogate PRIMARY KEYs

It may or may not be useful in any particular case.

Therefore, the answer to your question:

The question: should every table in a database have an IDENTITY field that's used as the PK?

is this:

No. There are cases when a database table should NOT have an IDENTITY field as a PRIMARY KEY.

Three cases come into my mind when it's not the best idea to have an IDENTITY as a PRIMARY KEY:

  • If your PRIMARY KEY is composite (like in many-to-many link tables)
  • If your PRIMARY KEY is natural (like, a state code)
  • If your PRIMARY KEY should be unique across databases (in this case you use GUID / UUID / NEWID)

All these cases imply the following condition:

You shouldn't have IDENTITY when you care for the values of your PRIMARY KEY and explicitly insert them into your table.

Update:

Many-to-many link tables should have the pair of id's to the table they link as the composite key.

It's a natural composite key which you already have to use (and make UNIQUE), so there is no point to generate a surrogate key for this.

I don't see why would you want to reference a many-to-many link table from any other table except the tables they link, but let's assume you have such a need.

In this case, you just reference the link table by the composite key.

This query:

CREATE TABLE a (id, data)
CREATE TABLE b (id, data)
CREATE TABLE ab (a_id, b_id, PRIMARY KEY (a_id, b_id))
CREATE TABLE business_rule (id, a_id, b_id, FOREIGN KEY (a_id, b_id) REFERENCES ab)

SELECT  *
FROM    business_rule br
JOIN    a
ON      a.id = br.a_id

is much more efficient than this one:

CREATE TABLE a (id, data)
CREATE TABLE b (id, data)
CREATE TABLE ab (id, a_id, b_id, PRIMARY KEY (id), UNIQUE KEY (a_id, b_id))
CREATE TABLE business_rule (id, ab_id, FOREIGN KEY (ab_id) REFERENCES ab)

SELECT  *
FROM    business_rule br
JOIN    a_to_b ab
ON      br.ab_id = ab.id
JOIN    a
ON      a.id = ab.a_id

, for obvious reasons.

Quassnoi
I think you make a good point.
jcollum
On second thought I don't think you really answered the question as a yes/no. I don't care for this idea of using a composite key either. I know it happens in practice, but when building a database I'd prefer to have one field that is a unique PK (and an Identity field makes it easy). That way a change to your schema means only changing the unique index on one table instead of every place that that key is used.
jcollum
Also, I edited the question to clear up confusion: I do understand the difference between Unique, PK and Identity.
jcollum
@jcollum: I cannot give a single `yes/no` answer for all cases. `IDENTITY` is an option in `CREATE TABLE`, which means there are cases you do want to use it and cases when you don't want. I try to explain these cases. Could you please tell what exactly didn't you like in these explanations so that you downvoted them?
Quassnoi
@jcollum: as for the composite keys, there are cases when you just *have* to use them. A many-to-many link table is a classical example of such a case.
Quassnoi
I still disagree. I'd prefer to have a many to many table look like ID, FK1, FK2 and have a unique index on FK1/FK2. If I ever want to use the many to many relationship ID in another table it'll be much easier to carry around the ID field than the two FKs. I have to think that the performance would be better in such a situation as well.
jcollum
This one does seem to be getting more votes though, so even if I disagree I'll let the community decide which answer is the right one.
jcollum
@jcollum: what's the point of keeping `id` when your `FK1/FK2` uniquely define the row? Could you please provide a real world scenario which would use this `id` somehow?
Quassnoi
Scenario: I create a table (TableA) and six months later would like to reference a row in that table in a one-to-many relationship (TableB). I then have the choice of carrying the n number of columns in the composite key from TableA over to TableB or carrying the single int of the PK/Identity field. I'd really prefer to have a single int field to reference in that case. So I think that for the sake of expansion and refactoring I prefer the PK Identity model. I do see your point, I'm just not convinced yet.
jcollum
I also think that the identity approach is superior when it comes to refactoring. Scenario: I create a table that has 3 columns in it's PK. I then use that PK in 4 other tables. Then six months later I need to add/remove a column in the PK. That means refactoring 5 tables total. If I'd used an identity PK then I would only have to refactor one table.
jcollum
@jcollum: as for the first scenario: I asked about referencing a `many-to-many` link table from somewhere outside the tables it links. why would you want to do that?
Quassnoi
@jcollum: Changing the values (not speak of layout) of `PRIMARY KEY`'s is a sign of a bad design. The whole point of a `PRIMARY KEY` is that it never changes. It's what identifies the row. Your `PRIMARY KEY` is either natural (which means that the it never changes **by its nature**), or surrogate (which means you don't ever want to change it since it had no meaning from the begin with). A `many-to-many` link table is an example of a **natural** `PRIMARY KEY`. It's as stable as other (possibly surrrogate) `PRIMARY KEY`s are.
Quassnoi
I didn't say anything about changing values of keys, I was talking about changing the structure of a composite key.
jcollum
Calling a change in table structure a bad design is silly. Business rules change as projects go on.
jcollum
@jcollum: *sigh* OK.
Quassnoi
@Quassnoi: "sigh"? Really? When people respond with "sigh" I'm done talking to them.
jcollum
@jcollum: sorry for a "sigh", but you don't really seem to be talking with me anyway. I asked you this question: *referencing a many-to-many link table from somewhere outside the tables it links. Why would you want to do that?* **twice**, and you didn't care to respond.
Quassnoi
Asking why is pretty irrelevant. It does happen, in my experience. Why? Because the business rules said so.
jcollum
@jcollum: it would be really more simple to speak of some real world scenario than of some abstract spherical business rules in vacuum, but I'll still try. Your business rules involve selecting some data from the table being linked: `SELECT * FROM business_rule br JOIN a_to_b TABLE ON br.a_to_b_id = a_to_b.id JOIN a IN a_to_b_id_a = a.id`. Why having an extra join here? If you were referencing `a_to_b` by the composite key (which you have anyway), you could just issue this `SELECT * FROM business_rule br JOIN a ON br.a_id = a.id`, which is more readable, more efficient and more manageable.
Quassnoi
I think that we're talking about a very abstract concept here, so specifics merely cause distraction. Nevertheless, here goes. I have 2 tables: Wizzle, Wozzle. I join these together in WizzleWozzle, which has WizzleId and WozzleId. Let's assume that multiple Wizzles can be assigned to a Wozzle. Now if I want to use this relationship in other tables I can go surrogate: WizzleWozzleId or composite: WizzleId, WozzleId. My point is that the database is more adaptable if you go with the former. If I use the composite key in 6 tables and need to change the structure of the key (cont...)
jcollum
@Quassnoi: I think jcollum is describing an auxiliary table that references a_to_b and he does not want to use two columns to do that. There may indeed be scenarios where that makes sense, but I would certainly say they are few and far between--in the usual case, you are absolutely correct that a composite key on {a_id, b_id} is the appropriate model. jcollum's original question betrays him as having only a basic understanding of relational data modeling (which your answer does well to address) and so I suspect that he's not operating in one of the aforementioned scenarios.
samael
... from Wizzle/Wozzle to Wizzle/Wozzle/Wuzzle then I need to change 6 tables. Assuming that there's a business layer etc here then I've got a lot of code to change and test. If I'd gone with a single surrogate key I'd have far fewer changes to make: I'd only have to change the unique constraint on the WizzleWozzle table and add a column to that table.
jcollum
@samael: This debate is far from decided; in this thread: http://stackoverflow.com/questions/63090/surrogate-vs-natural-business-keys, most of the responders wanted to go with surrogate keys. I'm not prepared to call this settled.
jcollum
@samael: Also, the last database I was involved with building had some messy many to many relations. Like this: 5 tables, A...E. There was an AtoB table. An AtoBtoC table. Etc up to E. If we'd used composite keys it would've been a mess.
jcollum
@jcollum: so you are willing to guaranteedly sacrifice performance to disputabley ease the table redesign? Note that for `many-to-many` link tables, the composite key is always `2` columns wide. If you link `3` tables, you don't make one `many-to-many-to-many` table with a `PRIMARY KEY` or `3` columns: you make `2` `many-to-many` tables with `2` columns `PRIMARY KEY` in each. So I don't quite get on what "primary key redesign" you are talking.
Quassnoi
@jcollum: `A-to-B-to-C` is a classical example of `4NF` violation: http://en.wikipedia.org/wiki/Fourth_normal_form
Quassnoi
+4  A: 

I'm a firm beliver that natural keys are often far worse than articifal keys because you often have no control over whether they will change which can cause horrendous data integrity or performance problems.

However, there are some (very few) natural keys that make sense without being an identity field (two-letter state abbreviation comes to mind, it is extremely rare for these official type abbreviations to change.)

Any table which is a join table to model a many to many realtionship probably also does not need an addtional identity filed. Making the two key fields together the primary key will work just fine.

Other than that I would in general add an identity field to most other tables unless given a compelling reason in that particular case not to. It is a bad practice to fail to create a primary key on a table or if you are using surrogate keys to fail to place a unique index on the other fields needed to guarantee uniqueness where possible (unless you really enjoy resolving duplicates).

HLGEM
Amen! Never use a key field if there is any chance ever that it could change.
thursdaysgeek
Um, let me clarify that. Never define a field as a primary key if there is any chance that it could change. Using the state code for a primary key is probably ok, but don't use your customer ID. What happens when one buys out another and they merge?
thursdaysgeek
Yes there are far too many possible natural keys that change very easily and thus very few that actually are suitable for use as a primary key.
HLGEM
Assuming that *anything* will never change in programming is a Bad Idea.
jcollum
And you don't want to be changing keys when you have lots of records, I have a table with over 8 million records that is a primary key for 99 other tables. If I used a key that could change instead of a surrogate key, I would have the database locked up every day just propagating changes to the child tables from changes to parent table key. Change is something to be avoided at all costs in database primary keys if you want your system to continue to function. So considering how often the key could change is critical to database design.
HLGEM
+1  A: 

Recognize the distinction between an Identity field and a key... Every table should have a key, to eliminate the data corruption of inadvertently entering multiple rows that represent the same 'entity'. If the only key a table has is a meaningless surrogate key, then this function is effectively missing.

otoh, No table 'needs' an identity, and certainly not every table benefits from one... Examples are: A table with a short and functional key, a table which does not have any other table referencing it through a foreign Key, or a table which is in a one to zero-or-one relationship with another table... none of these need an Identity

Charles Bretana
A: 

If you have modelled, designed, normalised etc, then you will have no identity columns.

You will have identified natural and candidate keys for your tables.

You may decide on a surrogate key because of the physical architecture (eg narrow, numeric, strictly monotonically increasing), say, because using a nvarchar(100) column is not a good idea (still need unique constraint).

Or because of ideology: they appeal to OO developers I've found.

Ok, assume ID columns. As your db gets more complex, say several layers, how can you jon parent and grand-.child tables directly. You can't: you always need intermediate tables and well indexed PK-FL columns. With a composite key, it's all there for you...

Don't get me wrong: I use them. But I know why I use them...

Edit:

I'd be interested to collate "always ID"+"no stored procs" matches on one hand, with "use stored procs"+"IDs when they benefit" on the other...

gbn
A: 

No. Whenever you have a table with an artificial identity column, you also need to identify the natural primary key for the table and ensure that there is a unique constraint on that set of columns too so that you don't get two rows that are identical apart from the meaningless identity column by accident.

Adding an identity column is not cost free. There is an overhead in adding an unnecessary identity column to a table - typically 4 bytes per row of storage for the identity value, plus a whole extra index (which will probably weigh in at 8-12 bytes per row plus overhead). It also takes slightly to work out the most cost-effective query plan because there is an extra index per table. Granted, if the table is small and the machine is big, this overhead is not critical - but for the biggest systems, it matters.

Jonathan Leffler