views:

1784

answers:

12

I am not very familiar with databases and the theories behind how they work. Is it any slower from a performance standpoint (inserting/updating/querying) to use Strings for Primary Keys than integers?

+9  A: 

Technically yes, but if a string makes sense to be the primary key then you should probably use it. This all depends on the size of the table you're making it for and the size of the string that is going to be the primary key (longer strings == harder to compare). I wouldn't necessary use a string for a table that has millions of rows, but the amount of performance slowdown you'll get by using a string on smaller tables will be minuscule to the headaches that you can have by having an integer that doesn't mean anything in relation to the data.

Kevin
wouldn't it depend on the database? I would think a properly indexed string would not be that much slower if at all from a number?
Ryan Guill
I would agree there are a lot of variables to consider. (In sqlserver) we have seen real performance issues with using strings with lengths in the mid to high teens and above even when indexed. Buy you're right there are things to overcome this hardware for instance.
Kevin
Fair enough. I would agree though that if a string makes sense, that is what you should use. I would also say there are definitely times for GUID or UUID fields in databases where an autoincrement field would not work.
Ryan Guill
oh absolutely! I totally agree. There's definitely a use for everything. It's just figuring out which tool to use when :).
Kevin
Also keep in mind that there's often a very big difference between a CHAR and a VARCHAR when doing index comparisons
Tom H.
+1  A: 

Indices imply lots of comparisons.

Strings are harder to compare than integers, for obvious reasons.

Sometimes, though, it's faster to use a string as a primary key than to make an extra join with a string to numerical id table.

Quassnoi
A: 

Probably. But it sounds like a case of premature optimization. Surely your body of data or your domain isn't so complex that you need to know this so soon.

Mark Canlas
SInce this is critical to the overall performance of the database forever, it is entirely correct to consider it in design. Make a bad choice here and you cannot fix it later easily without rewriting all your SQL code. Structural things that can't be changed easily once the rest of the house is built on top of them, must be done right from the start.
HLGEM
+1  A: 

Too many variables. It depends on the size of the table, the indexes, nature of the string key domain...

Generally, integers will be faster. But will the difference be large enough to care? It's hard to say.

Also, what is your motivation for choosing strings? Numeric auto-increment keys are often so much easier as well. Is it semantics? Convenience? Replication/disconnected concerns? Your answer here could limit your options. This also brings to mind a third "hybrid" option you're forgetting: Guids.

Joel Coehoorn
Strings for consistency among many databases
Grasper
that makes no sense cloutierm, what do you mean?
HLGEM
@HLGEM: If I understand him write, he means like syncing records created on a laptop with the main db.
Joel Coehoorn
I mean I have two separate databases with the same entities, only one is updated less frequently for persistent storage purposes. If I Query for entity "California" on Database A, I want it to be fundamentally the same "California" on Database B.
Grasper
that's replication.
Joel Coehoorn
And it is 'like' syncing records created in a laptop in that it's the same problem: records created in one place should not conflict with records created in another. One possible solution here is Guid keys.
Joel Coehoorn
+3  A: 

Another issue with using Strings as a primary key is that because the index is constantly put into sequential order, when a new key is created that would be in the middle of the order the index has to be resequenced... if you use a auto number integer, the new key is just added to the end of the index.

Jeff Martin
good point.. did not think of that.
Grasper
This can cause "hot spots" for new inserts though. As long as you're managing your database properly, you should have extra space on your pages for inserts anyway and page splits should be rare.
Tom H.
that is when primary keys are clustered. you can create them unclustered too.
Learning
Good point Learning
Tom H.
A: 

Yes, but unless you expect to have millions of rows, not using a string-based key because it's slower is usually "premature optimization." After all, strings are stored as big numbers while numeric keys are usually stored as smaller numbers.

One thing to watch out for, though, is if you have clustered indices on a any key and are doing large numbers of inserts that are non-sequential in the index. Every line written will cause the index to re-write. if you're doing batch inserts, this can really slow the process down.

Jekke
A: 

What is your reason for having a string as a primary key?

I would just set the primary key to an auto incrementing integer field, and put an index on the string field.

That way if you do searches on the table they should be relatively fast, and all of your joins and normal look ups will be unaffected in their speed.

You can also control the amount of the string field that gets indexed. In other words, you can say "only index the first 5 characters" if you think that will be enough. Or if your data can be relatively similar, you can index the whole field.

SkippyFire
I think putting any intelligence in a key is asking for trouble. Will they stay unique? Did they start all the account numbers with the state's abbreviation at the begining only to the client move. Update a field - no problem - all those tables linked by account number - what a mess.
Jeff O
+1  A: 

It doesn't matter what you use as a primary key so long as it is UNIQUE. If you care about speed or good database design use the int unless you plan on replicating data, then use a GUID.

If this is an access database or some tiny app then who really cares. I think the reason why most of us developers slap the old int or guid at the front is because projects have a way of growing on us, and you want to leave yourself the option to grow.

Al Katawazi
+3  A: 

Strings are slower in joins and in real life they are very rarely really unique (even when they are supposed to be). The only advantage is that they can reduce the number of joins if you are joining to the primary table only to get the name. However, strings are also often subject to change thus creating the problem of having to fix all related records when the company name changes or the person gets married. This can be a huge performance hit and if all tables that should be related somehow are not related (this happens more often than you think), then you might have data mismatches as well. An integer that will never change through the life of the record is a far safer choice from a data integrity standpoint as well as from a performance standpoint. Natural keys are usually not so good for maintenance of the data.

HLGEM
Strings that are good candidates for PKs do not have duplicates - otherwise they wouldn't be a good candidate for a PK. Think of ICD-9 codes, country codes, VIN #s. Using a name as an example of a problem with natural keys is misguided, because they should never be a candidate in the first place.
Tom H.
@Tom H: ISO County codes DO change. [ http://en.wikipedia.org/wiki/ISO_3166-1#Editions_and_changes ] As an answer on a related question said [ http://stackoverflow.com/questions/925266/database-design-and-the-use-of-non-numeric-primary-keys/925290#925290 ] "For PRIMARY KEY's make sure their uniqueness is under your control"
Steve Schnepp
A: 

There could be a very big misunderstanding related to string in the database are. Almost everyone has thought that database representation of numbers are more compact than for strings. They think that in db-s numbers are represented as in the memory. BUT it is not true. In most cases number representation is more close to A string like representation as to other.

The speed of using number or string is more dependent on the indexing then the type itself.

takacsot
A: 

From performance standpoint - Yes string(PK) will slow down the performance when compared to performance achieved using an integer(PK), where PK ---> Primary Key.

From requirement standpoint - Although this is not a part of your question still I would like to mention. When we are handling huge data across different tables we generally look for the probable set of keys that can be set for a particular table. This is primarily because there are many tables and mostly each or some table would be related to the other through some relation ( a concept of Foreign Key ). Therefore we really cannot always choose an integer as a Primary Key, rather we go for a combination of 3, 4 or 5 attributes as the primary key for that tables. And those keys can be used as a foreign key when we would relate the records with some other table. This makes it useful to relate the records across different tables when required.

Therefore for Optimal Usage - We always make a combination of 1 or 2 integers with 1 or 2 string attributes, but again only if it is required.

A: 

Don't worry about performance until you have got a simple and sound design that agrees with the subject matter that the data describes and fits well with the intended use of the data. Then, if performance problems emerge, you can deal with them by tweaking the system.

In this case, it's almost always better to go with a string as a natural primary key, provide you can trust it. Don't worry if it's a string, as long as the string is reasonably short, say about 25 characters max. You won't pay a big price in terms of performance.

Do the data entry people or automatic data sources always provide a value for the supposed natural key, or is sometimes omitted? Is it occasionally wrong in the input data? If so, how are errors detected and corrected?

Are the programmers and interactive users who specify queries able to use the natural key to get what they want?

If you can't trust the natural key, invent a surrogate. If you invent a surrogate, you might as well invent an integer. Then you have to worry about whther to conceal the surrogate from the user community. Some developers who didn't conceal the surrogate key came to regret it.

Walter Mitty