views:

398

answers:

4

Hi

I've recently started developing my first serious application which uses a SQL database, and I'm using phpMyAdmin to set up the tables. There are a couple optional "features" I can give various columns, and I'm not entirely sure what they do:

  • Primary Key
  • Index

I know what a PK is for and how to use it, but I guess my question with regards to that is why does one need one - how is it different from merely setting a column to "Unique", other than the fact that you can only have one PK? Is it just to let the programmer know that this value uniquely identifies the record? Or does it have some special properties too?

I have no idea what "Index" does - in fact, the only times I've ever seen it in use are (1) that my primary keys seem to be indexed, and (2) I heard that indexing is somehow related to performance; that you want indexed columns, but not too many. How does one decide which columns to index, and what exactly does it do?

edit: should one index colums one is likely to want to ORDER BY?

Thanks a lot,

Mala

+2  A: 

Primary key is usually used to create a numerical 'id' for your records, and this id column is automatically incremented.

For example, if you have a books table with an id field, where the id is the primary key and is also set to auto_increment (Under 'Extra in phpmyadmin), then when you first add a book to the table, the id for that will become 1'. The next book's id would automatically be '2', and so on. Normally, every table should have at least one primary key to help identifying and finding records easily.

Indexes are used when you need to retrieve certain information from a table regularly. For example, if you have a users table, and you will need to access the email column a lot, then you can add an index on email, and this will cause queries accessing the email to be faster.

However there are also downsides for adding unnecessary indexes, so add this only on the columns that really do need to be accessed more than the others. For example, UPDATE, DELETE and INSERT queries will be a little slower the more indexes you have, as MySQL needs to store extra information for each indexed column. More info can be found at this page.

Edit: Yes, columns that need to be used in ORDER BY a lot should have indexes, as well as those used in WHERE.

Click Upvote
thank you, you have been very helpful! I was wondering though, is setting a column as a PK any different from setting a column as a unique Integer that auto_increments? Does it do anything else under the hood?
Mala
I've edited to answer your question about downsides of indexes. Primary keys also function as indexes, so when you have a PK on a column, all queries for SELECT or ORDER BY based on that id will be faster. Also, it would be ensured that each 'id' is unique, hence you won't have duplicate IDs like you might have if you have just an INT column you are updating.
Click Upvote
A primary key does not need to be numeric, or a single field. and IDENTITY column would fit that description better. Instead a Primary Key serves to tell the RDBMS that the specified fields can uniquely identify a single row in the table. Essentially a UNIQUE INDEX. It is normal (but not required) for Primary Keys to also be the controlling factor in what order data is stored (Clustered Primary Key in SQLServer) and so significantly improve the time take to find rows of data (for read, write, whatever)
Dems
An index creates a searching tree for the information, therefore reducing the the complexity of the search from O(n) to O(log n) for one table. With joined tables the complexity reduction can be even higher.
txwikinger
+4  A: 

The primary key is basically a unique, indexed column that acts as the "official" ID of rows in that table. Most importantly, it is generally used for foreign key relationships, i.e. if another table refers to a row in the first, it will contain a copy of that row's primary key.

Note that it's possible to have a composite primary key, i.e. one that consists of more than one column.

Indexes improve lookup times. They're usually tree-based, so that looking up a certain row via an index takes O(log(n)) time rather than scanning through the full table.

Generally, any column in a large table that is frequently used in WHERE, ORDER BY or (especially) JOIN clauses should have an index. Since the index needs to be updated for evey INSERT, UPDATE or DELETE, it slows down those operations. If you have few writes and lots of reads, then index to your hear's content. If you have both lots of writes and lots of queries that would require indexes on many columns, then you have a big problem.

Michael Borgwardt
+1  A: 

The difference between a primary key and a unique key is best explained through an example.

We have a table of users:

USER_ID number 
NAME varchar(30)
EMAIL varchar(50)

In that table the USER_ID is the primary key. The NAME is not unique - there are a lot of John Smiths and Muhammed Khans in the world. The EMAIL is necessarily unique, otherwise the worldwide email system wouldn't work. So we put a unique constraint on EMAIL.

Why then do we need a separate primary key? Three reasons:

  1. the numeric key is more efficient when used in foreign key relationships as it takes less space
  2. the email can change (for example swapping provider) but the user is still the same; rippling a change of a primary key value throughout a schema is always a nightmare
  3. it is always a bad idea to use sensitive or private information as a foreign key
APC
+1  A: 

In the relational model, any column or set of columns that is guaranteed to be both present and unique in the table can be called a candidate key to the table. "Present" means "NOT NULL". It's common practice in database design to designate one of the candidate keys as the primary key, and to use references to the primary key to refer to the entire row, or to the subject matter item that the row describes.

In SQL, a PRIMARY KEY constraint amounts to a NOT NULL constraint for each primary key column, and a UNIQUE constraint for all the primary key columns taken together. In practice many primary keys turn out to be single columns.

For most DBMS products, a PRIMARY KEY constraint will also result in an index being built on the primary key columns automatically. This speeds up the systems checking activity when new entries are made for the primary key, to make sure the new value doesn't duplicate an existing value. It also speeds up lookups based on the primary key value and joins between the primary key and a foreign key that references it. How much speed up occurs depends on how the query optimizer works.

Originally, relational database designers looked for natural keys in the data as given. In recent years, the tendency has been to always create a column called ID, an integer as the first column and the primary key of every table. The autogenerate feature of the DBMS is used to ensure that this key will be unique. This tendency is documented in the "Oslo design standards". It isn't necessarily relational design, but it serves some immediate needs of the people who follow it. I do not recommend this practice, but I recognize that it is the prevalent practice.

An index is a data structure that allows for rapid access to a few rows in a table, based on a description of the columns of the table that are indexed. The index consists of copies of certain table columns, called index keys, interspersed with pointers to the table rows. The pointers are generally hidden from the DBMS users. Indexes work in tandem with the query optimizer. The user specifies in SQL what data is being sought, and the optimizer comes up with index strategies and other strategies for translating what is being sought into a stategy for finding it. There is some kind of organizing principle, such as sorting or hashing, that enables an index to be used for fast lookups, and certain other uses. This is all internal to the DBMS, once the database builder has created the index or declared the primary key.

Indexes can be built that have nothing to do with the primary key. A primary key can exist without an index, although this is generally a very bad idea.

Walter Mitty