views:

553

answers:

12

I know this is subjective, but I'd like to know peoples opinions and hopefully some best practices that I can apply when designing sql server table structures.

I personally feel that keying a table on a fixed (max) length varchar is a no-no, because it means having to also propogate the same fixed length across any other tables that use this as a foreign key. Using an int, would avoid having to apply the same length across the board, which is bound to lead to human error, i.e. 1 table has varchar (10), and the other varchar (20).
This sounds like a nightmare to initially setup, plus means future maintaining of the tables is cumbersome too. For example, say the keyed varchar column suddenly becomes 12 chars instead of 10. You now have to go and update all the other tables, which could be a huge task years down the line.

Am I wrong? Have I missed something here? I'd like to know what others think of this and if sticking with int for primary keys is the best way to avoid maintainace nightmares.

Thanks in advance.

+10  A: 

I would definitely recommend using an INT NOT NULL IDENTITY(1,1) field in each table as the primary key.

With an IDENTITY field, you can let the database handle all the details of making sure it's really unique and all, and the INT datatype is just 4 bytes, and fixed, so it's easier and more suited to be used for the primary (and clustering) key in your table.

And you're right - INT is an INT is an INT - it will not change its size of anything, so you won't have to ever go recreate and/or update your foreign key relations.

Using a VARCHAR(10) or (20) just uses up too much space - 10 or 20 bytes instead of 4, and what a lot of folks don't know - the clustering key value will be repeated on every single index entry on every single non-clustered index on the table, so potentially, you're wasting a lot of space (not just on disk - that's cheap - but also in SQL Server's main memory). Also, since it's variable (might be 4, might be 20 chars) it's harder to SQL server to properly maintain a good index structure.

Marc

marc_s
And your reasons for not using varchar? same as mine or other reasons?
HAdes
Either that or a UNIQUEIDENTIFIER. A guid makes a good key.
Max Schmeling
@HAdes human error is always possible. My reason would be just because nobody does it that way, it smells, and it probably isn't as efficient but I don't know that for sure.
Max Schmeling
@Max: If you want to use a GUID instead - use it for the PRIMARY KEY, but NEVER EVER use it as the clustering key on a SQL Server table. It's bad bad bad - awful performance, very bad index fragmentation.
marc_s
@marc_s that's what newsequentialid is for (where available)
Michael Haren
@Michael: even newsequentialguid is just a hack - it's still causing LOTS of index fragmentation - and it's just cumbersome to use - SELECT * FROM TABLE WHERE OID = 'ab3-123-123ba-09213' - I just can't remember stuff like that.......
marc_s
I'd also add that you should use *signed* ints--not unsigned. Unsigned types give you double the range but I don't think they're worth the hassle of coercing your data layer, tools, etc. every step of the way. Plus, hardly anyone does this
Michael Haren
Plus, the trouble with GUIDs as PK is that in most cases they're being used, it's because the dev wants to be able to set the GUID on the client - not let the DB handle the work --> hardly anyone I know ever really uses newsequentialid() :-(
marc_s
@marc_s--good points
Michael Haren
@marc_s - IMO, it is a *good* thing that writing a query with a guid is more cumbersome. It lessens the odds that a developer will show it to a user. Developers should never show a surrogate key to a user. In my experience, ints prove to be too tempting to show to users.
Thomas
@marc_s -The other argument against an int is any sort of database consolidation. Suppose you are building a product used by multiple customers each with their own database and they want to consolidate. Trying to consolidate databases with identity columns is the 10th level of hell.
Thomas
@Thomas: yes, consolidation with INT IDENTITY is cumbersome - but managable. But really: how many times do you have to deal with this?? Compare that to having lousy query performance (using GUIDs as clustering key) on **EVERY** query you send to that table.....
marc_s
@marc_s - I would not categorize consolidations with identity values as manageable. They are a royal nightmare and I've had to do them more often than I wish to remember. Right now, I work with a product system with a db backend so I do consolidations and imports from external systems more often than the average bear. In my exp, the difference in performance between guids and ints is frankly not significant when you use a COMB guid.
Thomas
+3  A: 

Like you said, consistency is key. I personally use unsigned ints. You're not going to run out of them unless you are working with ludicrous amounts of data, and you can always know any key column needs to be that type and you never have to go looking for the right value for individual columns.

CaptnCraig
A: 

While INT is generally recommended, it really depends on your situation.

If you're concerned with maintainability, then other types are just as feasible. For example, you could use a Guid very effectively as a primary key. There's reasons for not doing this, but consistency is not one of them.

But yes, unless you have a good reason not to, an int is the simplest to use, and the least likely to cause you any problems.

Nader Shirazie
I agree that is greatly depends on your situation. In some cases I think varchar is more useful than an int because it increases readability. As always there is a performance trade off but just with coding - you have to decide, performance or read ability
Dan
@Dan - I don't buy into this 'readability' argument. isn't that what views are for?
HAdes
+3  A: 

I'd agree that in general an INT (or identity) field type is the best choice in most "normal" database designs:

  • it requires no "algorithm" to generate the id/key/value
  • you have fast(er) joins and the optimizer can work a lot harder over ranges and such under the hood
  • you're following a defacto standard

That said, you also need to know your data. If you're going to blow through a signed 32-bit int, you need to think about unsigned. If you're going to blow through that, maybe 64-bit ints are what you want. Or maybe you need a UUID/hash to make syncing between database instances/shards easier.

Unfortunately, it depends and YMMV but I'd definitely use an int/identity unless you have a good reason not to.

John Paul Ashenfelter
A: 

With PostgreSQL I generally use the "Serial" or "BigSerial" 'data type' for generating primary keys. The values are auto incremented and I always find integers to be easy to work with. They are essentially equivalent to a MySQL integer field that is set to "auto_increment".

Nicholas Kreidberg
+1  A: 

For best performance, 99.999% of the time the primary key should be a single integer field.

Unless you require the primary key to be unique across multiple tables in a database or across multiple databases. I am assuming that you are asking about MS SQL-Server because that is how your question was tagged. In which case, consider using the GUID field instead. Though better than a varchar, the GUID field performance is not as good as an integer.

Glenn
+1  A: 

Use INT. Your points are all valid; I would prioritize as:

  1. Ease of using SQL auto increment capabiity - why reinvent the wheel?
  2. Managability - you don't want to have to change the key field.
  3. Performance
  4. Disk Space

1 & 2 require the developer's time/energy/effort. 3 & 4 you can throw hardware at.

Jeff O
A: 

One should think hard about whether 32-bit range is enough for what you're doing. Twitter's status IDs were 32-bit INTs and they had trouble when they ran out.

Whether to use a BIGINT or a UUID/GUID in that situation is debatable and I'm not a hardcore database guy, but UUIDs can be stored in a fixed-length VARCHAR without worrying that you'll need to change the field size.

Alan Smithee
good point - if 2 billion rows might be tight, use BIGINT instead :-) That should keep even hard-core twitterers happy for a while :-)
marc_s
A: 

We have to keep in mind that the primary key of a table should not have "business logic" and it should be only an identity of the record it belongs. Following this simple rule an int and especially an identity int is a very good solution. By asking about varchar I guess that you mean using for example the "Full Name" as a key to the "people" table. But what if we want to change the name from "George Something" to "George A. Something" ? And what size will the field be ? If we change the size we have to change the size on all foreign tables too. So we should avoid logic on keys. Sometimes we can use the social ID (integer value) as key but I avoid that too. Now if a project has the prospects to scale up you should consider using Guids too (uniqueidentifier SQL type).

George Statis
+10  A: 

When choosing the primary key usualy you also choose the clustered key. Them two are often confused, but you have to understand the difference.

Primary keys are logical business elements. The primary key is used by your application to identify an entity, and the discussion about primary keys is largely wether to use natural keys or surrogate key. The links go into much more detail, but the basic idea is that natural keys are derived from an existing entity property like ssn or phone number, while surrogate keys have no meaning whatsoever with regard to the business entity, like id or rowid and they are usually of type IDENTITY or some sort of uuid. My personal opinion is that surrogate keys are superior to natural keys, and the choice should be always identity values for local only applicaitons, guids for any sort of distributed data. A primary key never changes during the lifetime of the entity.

Clustered keys are the key that defines the physical storage of rows in the table. Most times they overlap with the primary key (the logical entity identifier), bu that is not actually enforced nor required. When the two are different it means there is a non-clustered unique index on the table that implements the primary key. Clustered key values can actualy change during the lifetime of the row, resulting in the row being physically moved in the table to a new location. If you have to separate the primary key from the clustered key (and sometimes you do), choosing a good clustered key is significantly harder than choosing a primary key. There are two primary factors that drive your clustered key design:

  1. The prevalent data access pattern.
  2. The storage considerations.

Data Access Pattern. By this I understand the way the table is queried and updated. Remember that clustered keys determine the actual order of the rows in the table. For certain access patterns, some layouts make all the difference in the world in regard to query speed or to update concurency:

  • current vs. archive data. In many applications the data belonging to the current month is frequently accessed, while the one in the past is seldom accessed. In such cases the table design uses table partitioning by transaction date, often times using a sliding window algorithm. The current month partition is kept on filegroup located a hot fast disk, the archived old data is moved to filegroups hosted on cheaper but slower storage. Obviously in this case the clustered key (date) is not the primary key (transaction id). The separation of the two is driven by the scale requirements, as the query optimizer will be able to detect that the queries are only interested in the current partition and not even look at the historic ones.

  • FIFO queue style processing. In this case the table has two hot spots: the tail where inserts occur (enqueue), and the head where deletes occur (dequeue). The clustered key has to take this into account and organize the table as to physically separate the tail and head location on disk, in order to allow for concurency between enqueue and dequeue, eg. by using an enqueue order key. In pure queues this clustered key is the only key, since there is no primary key on the table (it contains messages, not entities). But most times the queue is not pure, it also acts as the storage for the entities, and the line between the queue and the table is blured. In this case there is also a primary key, which cannot be the clustered key: entities may be re-enqueued, thus changing the enqueue order clustered key value, but they cannot change the primary key value. Failure to see the separation is the primary reason why user table backed queues are so notoriously hard to get right and riddled with deadlocks: because the enqueue and dequeue occur interleaved trought the table, instead of localized at the tail and the head of the queue.

  • Correlated processing. When the application is well designed it will partition processing of correlated items between its worker threads. For instance a processor is designed to have 8 worker thread (say to match the 8 CPUs on the server) so the processors partition the data amongst themselves, eg. worker 1 picks up only accounts named A to E, worker 2 F to J etc. In such cases the table should be actually clustered by the account name (or by a composite key that has the leftmost position the first letter of account name), so that workers localize their queries and updates in the table. Such a table would have 8 distinct hot spots, around the area each worker concentrates at the moment, but the important thing is that they don't overlap (no blocking). This kind of design is prevalent on high throughput OLTP designs and in TPCC benchmark loads, where this kind of partitioning also reflects in the memory location of the pages loaded in the buffer pool (NUMA locality), but I digress.

Storage Considerations. The clustered key width has huge repercursions in the storage of the table. For one the key occupies space in every non-leaf page of the b-tree, so a large key will occupy more space. Second, and often more important, is that the clustered key is used as the lookup key by every non-clustred key, so every non-clustered key will have to store the full width of the clustered key for each row. This is what makes large clustered keys like varchar(256) and guids poor choices for clustered index keys.
Also the choice of the key has impact on the clustered index fragmentation, sometimes drastically affecting performance.

These two forces can sometimes be antagonistic, the data access pattern requiring a certain large clustered key which will cause storage problems. In such cases of course a balance is needed, but there is no magic formula. You measure and you test to get to the sweet spot.

So what do we make from all this? Always start with considering clustered key that is also the primary key of the form entity_id IDENTITY(1,1) NOT NULL. Separate the two and organize the table accordingly (eg. partition by date) when appropiate.

Remus Rusanu
+1 excellent explanation, Remus!
marc_s
"A primary key never changes during the lifetime of the entity." -- Why not?
pst
@pst: because other systems use the primary key to identify the entity. If the PK changes, their reference gets lost or, worse, refers to a different entity. Same reason why you cannot change the address on an object in memory: all pointers to it become invalid.
Remus Rusanu
A: 

Based on going through this exercise countless times and then supporting the system with the results, there are some caveats to the blanket statement that INT is always better. In general, unless there is a reason, I would go along with that. However, in the trenches, here are some pros and cons.

INT

  • Use unless good reason not to do so.

GUID

  • Uniqueness - One example is the case where there is one way communication between remote pieces of the program and the side that needs to initiate is not the side with the database. In that case, setting a Guid on the remote side is safe where selecting an INT is not.
  • Uniqueness Again - A more far fetched scenario is a system where multiple customers are coexisting in separate databases and there is migration between customers like similar users using a suite of programs. If that user signs up for another program, their user record can be used there without conflict. Another scenario is if customers acquire entities from each other. If both are on the same system, they will often expect that migration to be easier. Essentially, any frequent migration between customers.
  • Hard to Use - Even an experienced programmer cannot remember a guid. When troubleshooting, it is often frustrating to have to copy and paste identifiers for queries, especially if the support is being done with a remote access tool. It is much easier to constantly refer to SELECT * FROM Xxx WHERE ID = 7 than SELECT * FROM Xxx WHERE ID = 'DF63F4BD-7DC1-4DEB-959B-4D19012A6306'

  • Indexing - using a clustered index for a guid field requires constant rearrangement of the data pages and is not as efficient to index as INTs or even short strings. It can kill performance - don't do it.

CHAR

  • Readability - Although conventional wisdom is that nobody should be in the database, the reality of systems is that people will have access - hopefully personnel from your organization. When those people are not savvy with join syntax, a normalized table with ints or guids is not clear without many other queries. The same normalized table with SOME string keys can be much more usable for troubleshooting. I tend to use this for the type of table where I supply the records at installation time so they do not vary. Things like StatusID on a major table is much more usable for support when the key is 'Closed' or 'Pending' than a digit. Using traditional keys in these areas can turn an easily resolved issue to something that requires developer assistance. Bottlenecks like that are bad even when caused by letting questionable personnel access to the database.
  • Constrain - Even if you use strings, keep them fixed length, which speeds indexing and add a constraint or foreign key to keep garbage out. Sometimes using this string can allow you to remove the lookup table and maintain the selection as a simple Enum in the code - it is still important to constrain the data going into this field.
Jim Blake
Using a GUID as the clustered index is a deadly sin and will kill all your performance - even if you rebuild the index every night! Just don't do it - period.
marc_s
@marc_s Agreed. I edited the answer to state that more forcefully. Thanks.
Jim Blake
+1  A: 

If Joe Celko was on here, he would have some harsh words... ;-)

I want to point out that INTs as a hard and fast rule aren't always appropriate. Say you have a vehicle table with all types of cars trucks, etc. Now say you had a VehicleType table. If you wanted to get all trucks you might do this (with an INT identity seed):

SELECT V.Make, V.Model
FROM Vehicle as V
INNER JOIN VehicleType as VT
ON V.VehicleTypeID = VT.VehicleTypeID
WHERE VT.VehicleTypeName = 'Truck'

Now, with a Varchar PK on VehicleType:

SELECT Make, Model
FROM Vehicle 
WHERE VehicleTypeName = 'Truck'

The code is a little cleaner and you avoid a join. Perhaps the join isn't the end of the world, but if you only have one tool in your toolbox, you're missing some opportunities for performance gains and cleaner schemas.

Just a thought. :-)

Strommy
Your right the 2nd query looks cleaner. Is this really what matters? Couldn't you use ID's' and still keep this clean by using views? Are you sure there are performance gains with the second approach, as other posts have said the opposite. cheers.
HAdes
Yes, it will be more performant CPU-wise although you are going to exchange some data storage and cache memory as a result. The performance gains come from joining integers to each other. It is less costly to do compares on an int than a char. But you don't have to do the join at all in the case I provide. Joining is more expensive than not joining in this case. My point is you shouldn't take an absolutist approach to database design.
Strommy
The performance hit on the joining approach sounds extremely minimal and I'd imagine you wouldn't notice any impact unless your tables were absolutely massive, right? If this is the case, then surely good design with int ID's is better no?
HAdes
Strommy