views:

85

answers:

5

Say I have a database looking like this;

Product with columns [ProductName] [Price] [Misc] [Etc]
Order with columns [OrderID] [ProductName] [Quantity] [Misc] [Etc] 

ProductName is primary key of Product, of some string type and unique.
OrderID is primary key and of some integer type, and ProductName being a foreign key.

Say I change the primary key of Product to a new column of integer type ie [ProductID].

Would this reduce the database size and optimize lookups joining these two tables (and likewise operations), or are these optimizations performed automatically by (most/general/main) SQL database implementations?

Technically, using (String) ProductName as primary key in Product, a database should be able to implement the ProductName column in Order as simply a pointer to a row in Product, and perform a JOIN as quicly as having an integer as a foreign key, is this a standard way of implementing SQL.

Update: This question is about how SQL servers handles foreign keys, not whether a product table needs a serial number, or how I handle to product name change in a database.

+1  A: 

A string primary key is a bad idea, so changing it to an INT will help performance. most databases uses the primary key index for lookups and comparisons, choose a brief primary key—one column, if possible. You use primary key columns for joins (combining data from two or more tables based on common values in join columns), for query retrieval, and for grouping or sorting a query result set. The briefer the index entries are, the faster the database can perform the lookups and comparisons.

Not to mention, if the name of the product changes, how can you handle that? update all rows that contain the product name as a Foreign Key?

I couldn't have said it any better, so check out this answer: Should I design a table with a primary key of varchar or int, quote from that answer:

Using a VARCHAR(10) or (20) just uses up too much space - 10 or 20 bytes instead of 4, and what a lot of folks don't know - the clustering key value will be repeated on every single index entry on every single non-clustered index on the table, so potentially, you're wasting a lot of space (not just on disk - that's cheap - but also in SQL Server's main memory). Also, since it's variable (might be 4, might be 20 chars) it's harder to SQL server to properly maintain a good index structure

KM
+1 for pointing out the problems with changing the name of a product
Donnie
A: 

integer column acts better than string in joins

integer autoinc columns as primary clustered key is good for inserts

Anton Burtsev
A: 

I won't reduce database size (presumably you'll keep the product name field), but should definitely improve lookup performance.

Galwegian
A: 

Integer datatype in most implementations will be less in size than the string (CHAR, VARCHAR etc.), this will make your index smaller in size.

In addition, there are some issues with comparing the strings:

  1. Some databases, namely MySQL, compress the string keys which can make the searches less efficient.

  2. String B-Trees that use natural language identifiers tend to be less concurrency balanced than integer B-Trees. Since the natural language words are not distributed evenly across the alphabet, more updates and inserts will go to the same block, increasing the number of page splits and ultimately increasing the index size. To work around this, Oracle supports REVERSE clause in indexes.

  3. When comparing two strings, a collation should be taken into account. Normally, it does not matter much, however, it does add some overhead.

Quassnoi
B, B+, and B-* Trees are *always* balanced. REVERSE doesn't affect how the B-tree stores data, it just reverses the order of the string in the index so that you can do LIKE searches on the beginning of strings. In SQL Server, you would do this by adding an indexed computed column on a reversed string. And CHAR columns aren't bigger than integers, as long as you're using less than 4 characters.
Dave Markle
@Dave: "Balanced" here means not "height balanced" but "concurrency balanced", i. e. page contention is reduced. Oracle won't use `REVERSE` indexes for the `LIKE` predicates with a leading `%`, it only can use such an index on an equijoin.
Quassnoi
A: 

Primary keys should be unique, exist at time of row creation and be as immutable as possible. IMO, discussions about whether to use a surrogate key should be secondary to issues of data integrity.

If for example a product had a serial number stamped on the item, which had to exist at the time the row in the database was entered and was guaranteed to be unique, then IMO that would make a good primary key. The reason is this value will be used as the foreign key in other tables and it saves you the expense of an additional lookup to get the product's serial number. The additional storage space is inconsequential until you get into the many millions of rows. However, if the serial number was stamped by some other manufacturer so you had no guarantees of uniqueness ("it is probably unique" is not good enough), then a surrogate is appropriate. In fact, I would go so far as to say a good portion if not most "products" tables use surrogate keys because no value that is guaranteed to be available at time of entry, guaranteed to be unique and will be relatively immutable is available as a key.

However, many developers that use surrogate keys overlook the need that every table that has a surrogate key should also have another key (i.e. a unique constraint). Thus, in your case with products, even if you add an integer primary key, you should still have a unique constraint on product name. The unique constraint on product name creates what is called a candidate key with the integer value being the primary key.

Surrogate keys are meant to be behind-the-scenes goo. While integer keys perform the best and are easy to create they have one downside: it is easy, tempting even, for application developers to show the key value to users. This is a mistake IMO. Users should never see the key value or they will come to rely on the value itself which creates problems if you need to re-sequence the values (like say with a database merge) or if you use values that were created in gaps created by the Identity value and they rely on the values being sequential. As long as you never show the value to users, using an integer PK is fine.

Thomas