views:

2490

answers:

9

Is there a measurable performance difference between using INT vs. VARCHAR as a primary key in MySQL? I'd like to use VARCHAR as the primary key for reference lists (think US States, Country Codes) and a coworker won't budge on the INT AUTO_INCREMENT as a primary key for all tables.

My argument, as detailed here, is that the performance difference between INT and VARCHAR is negligible, since every INT foreign key reference will require a JOIN to make sense of the reference, a VARCHAR key will directly present the information.

So, does anyone have experience with this particular use-case and the performance concerns associated with it?

+1  A: 

This article is about Oracle, but it probably applies.

Paul Tomblin
Page was unavailable when I tried to visit. Care to summarize for us?
Joel Coehoorn
@Joel, It basically says "use what makes more sense to you from an architecture point of view, and don't worry about performance"
Paul Tomblin
"In all my years of working with databases, I have never once worried about whether an index on VARCHAR2 or NUMBER columns performs better. "
Eli
+2  A: 

For short codes, there's probably no difference. This is especially true as the table holding these codes are likely to be very small (a couple thousand rows at most) and not change often (when is the last time we added a new US State).

For larger tables with a wider variation among the key, this can be dangerous. Think about using e-mail address/user name from a User table, for example. What happens when you have a few million users and some of those users have long names or e-mail addresses. Now any time you need to join this table using that key it becomes much more expensive.

Joel Coehoorn
Do you know for sure that this would be expensive? Or are you just guessing?
Steve McLeod
Of course it depends on the rdbms implementation, but from what I understand most servers will keep of hash of the actual value for indexing purposes. Even so, and even if it's a relatively short hash (say, 10 byte), it's still more work to compare 2 10 byte hashes than 2 4 byte ints.
Joel Coehoorn
+15  A: 

It's not about performance. It's about what makes a good primary key. Unique and unchanging over time. You may think an entity such as a country code never changes over time and would be a good candidate for a primary key. But bitter experience is that is seldom so.

INT AUTO_INCREMENT meets the "unique and unchanging over time" condition. Hence the preference.

Steve McLeod
True. One of my biggest databases has entries for Yugoslavia and the Soviet Union. I'm glad they're not primary keys.
Paul Tomblin
Couldn't you cascade changes?
jakemcgraw
Yes you can. But that's not how primary keys are supposed to work.
Steve McLeod
@Steve PKs are immutable?
jakemcgraw
@Steve, then why does ANSI SQL support syntax for ON UPDATE CASCADE?
Bill Karwin
What I mean is that primary keys are supposed to be considered immutable. "Unique and unchanging over time". Of course, you could change them if you want. You could cascade changes too. But then you are breaking the guidelines of good database design.
Steve McLeod
+2  A: 

Absolutely not.

I have done several... several... performance checks between INT, VARCHAR, and CHAR.

10 million record table with a PRIMARY KEY (unique and clustered) had the exact same speed and performance (and subtree cost) no matter which of the three I used.

That being said... use whatever is best for your application. Don't worry about the performance.

Timothy Khouri
meaningless without knowing how long the varchars were... If they were 100 bytes widem then guaranteed you're not getting the same performance as a 4 byte int
Charles Bretana
+2  A: 

As for Primary Key, whatever physically makes a row unique should be determined as the primary key.

For a reference as a foreign key, using an auto incrementing integer as a surrogate is a nice idea for two main reasons.
- First, there's less overhead incurred in the join usually.
- Second, if you need to update the table that contains the unique varchar then the update has to cascade down to all the child tables and update all of them as well as the indexes, whereas with the int surrogate, it only has to update the master table and it's indexes.

The drawaback to using the surrogate is that you could possibly allow changing of the meaning of the surrogate:

ex.
id value
1 A
2 B
3 C

Update 3 to D
id value
1 A
2 B
3 D

Update 2 to C
id value
1 A
2 C
3 D

Update 3 to B
id value
1 A
2 C
3 B

It all depends on what you really need to worry about in your structure and what means most.

Jason Lepack
A: 

Not sure about the performance implications, but it seems a possible compromise, at least during development, would be to include both the auto-incremented, integer "surrogate" key, as well as your intended, unique, "natural" key. This would give you the opportunity to evaluate performance, as well as other possible issues, including the changeability of natural keys.

George Jempty
+5  A: 

You make a good point that you can avoid some number of joined queries by using what's called a natural key instead of a surrogate key. Only you can assess if the benefit of this is significant in your application.

That is, you can measure the queries in your application that are the most important to be speedy, because they work with large volumes of data or they are executed very frequently. If these queries benefit from eliminating a join, and do not suffer by using a varchar primary key, then do it.

Don't use either strategy for all tables in your database. It's likely that in some cases, a natural key is better, but in other cases a surrogate key is better.

Other folks make a good point that it's rare in practice for a natural key to never change or have duplicates, so surrogate keys are usually worthwhile.

Bill Karwin
And sometimes, (imho, often), both is better, the surrogate to use for FK references in other tables, and for Joins, and the natural key to ensure data consistency
Charles Bretana
+2  A: 

Depends on the length.. If the varchar will be 20 characters, and the int is 4, then if you use an int, your index will have FIVE times as many nodes per page of index space omn disk... That means that traversing the index will require one fifth as many physical and/or logical reads..

So, if performance is an issue, given the opportunity, always use an integral non-meaningful key (called a surrogate) for your tables, and for Foreign Keys that reference the rows in these tables...

At the same time, to guarantee data consistency, every table where it matters should also have a meaningful non-numeric alternate key, (or unique Index) to ensure that duplicate rows (based on meanignful table attributes) cannot be inserted.

For the specific use you are talking about (like state lookups ) it really doesn't matter because the size of the table is so small.. In general here is no impact on performance from indices on tables with less than a few thousand rows...

Charles Bretana
A: 

I like to use 38 characters long GUIDs as primary keys. It's totally unique from all points of view

abatishchev