views:

125

answers:

5

This question is geared towards MySQL, since that is what I'm using -- but I think that it's probably the same or similar for almost every major database implementation.

How do keys work in a database? By that I mean, when you set a field to 'primary key', 'unique key' or an 'index' -- what do each of these do, and when should I use each one?

Right now I have a table containing a few fields, one of them being a GUID (minus the { and } around it). I set the GUID field to the primary key and I see that it created a binary tree. So it improves search performance -- but what differentiates that from other types of keys?

I realize this may not really be programming related (although it is development related) -- I wasn't sure where exactly to ask this but SO is what I use the most so I'll ask here. Migrate as necessary

+8  A: 

There are probably hundreds of references for this elsewhere on the web, so a bit of Googling will help you get deep into understanding DB design. That said, the basic gist is:

  • primary key: a field or combination of fields which must be unique for each row, and which is/are indexed to provide rapid lookup of a row given a key value; cannot contain NULL, and a table can only have one primary key. Generally indexed in a clustered index, which means that the data in the table is reordered to match the order of the index, a process that greatly improves serial data retrieval. (This is the main reason a table can only have one primary key -- the order of the data can't match the order of more than one index!)
  • unique key: same as a primary key, but on some DB platforms, can contain NULL values so long as they don't violate the uniqueness constraint. (In other words, if the unique key contains a single column, there can only be one row in the table with NULL in that column; if the key contains more than one column, then the table can only contain rows with NULLs in the columns such that there's no non-unique duplication of NULL values across the columns in the key.) On other platforms (including MySQL), unique constraints can contain multiple NULLs; the uniqueness constraint only applies to non-NULL values of the referenced columns. There can be more than one of these per table. Indexed in a non-clustered index.
  • index: a field or combination of fields which are pre-indexed for more rapid retrieval given a value for the field(s) in the index. A table can have more than one index.
delfuego
thank you, a basic understanding was all I was really looking for, I just wanted to make sure I was making the right decision (and can keep making the right decision) when I add a key to a table
Carson Myers
Good answer. FWIW, some RDBMS permit multiple NULLs in column(s) that have a unique constraint. And you didn't cover foreign keys.
Bill Karwin
Bill, since I just noticed that the OP was asking specifically about MySQL (which is one of the RDBMSes that allows multiple NULLs with UNIQUE constraints), I'll modify my answer... you're right. But the OP didn't ask about foreign keys...
delfuego
I'm still fairly new to SQL, delfuego answered what I was wondering -- I haven't needed to build a database complex enough to warrant foreign keys (assuming foreign keys are used with more complicated data relationships). I'm probably still a ways from needing structures like that -- I mean, I've written all of one JOIN query in my life, so far everything I've done has been simple.
Carson Myers
A: 

At the most basic, primary keys represent how the records will be physically stored in memory / on disk, you would want the unique field you're going to search on the most to be this as it will greatly reduce searching.

Unique key's are fields that can only contain unique values.

An index is a specialized "map" to the database file that queries can reference.

These are extremely simplified answers, but I think that's the gist of it.

mletterle
+1  A: 

When you define a primary key, the database creates an index based on that key. It needs to be unique. In general you can create an index that to speed up access to data based on non-unique query data. The indexed retrieval time for a uniquely keyed data should be better than for non-uniquely keyed indexes, so I try to use unique indexes where possible.

nont
A: 

One more thing, any key is essentially a separate table that is sorted by the index that points directly to the row(s) that match the key. A BTree style index is stored in a balanced tree, a balanced tree is a tree structure where traveling left is smaller and traveling right is larger.

    5
 3    7
2 4  6 8

Would be an example of a balanced tree. The other major type is a Hash, where a mathematical expression turns the key into the relative memory location of the key.

MindStalker
A: 

In order to really understand keys, you have to understand them at three levels: conceptual, logical, and physical. I'm going to reverse my habitual order, and discuss physical first.

Most programmers tend to think at the physical level. At the physical level, a key is a surrogate (stand-in) for the address of a row. When a row is to be referenced, a copy of the key can be used to specify the row. When a reference to a row is made in another row, the copy is known as a foreign key.

Most experienced programmers have a thorough understanding of pointers and addresses, and would understand exactly how the data structure worked if only it used pointers and addresses. Before the relational databases became dominant, there were in fact databases that used pointers to records embedded in other records to tie the data together.

A disadvantage to using keys instead of pointers is that the DBMS has to use an index to translate a key reference back to a pointer in order to retrieve the row in question. An advantage is that the level of indirection allows the DBMS to shuffle all the rows in a table for whatever purpose, as long as the DBMS updates all the relevant indexes accordingly.

Viewed at this level, keys might as well be simple, integer, and autoincremented. These work faster than other kinds of keys, and they sidestep certain data management issues that arise when user supplied data is missing or inconsistent. However, sidestepping data management issues at this level can create a minefield at the two higher levels.

At the logical level, a key is a minimal subset of the data in a tuple (row) that allows a single matching tuple to be specified, and when the DBMS retrieves the container for that tuple, all the attributes in the tuple are now available. Every relation has at least one candidate key. In the worst case, the entire tuple is the only candidate key. When multiple candidate keys exist for a single relation (table), common practice is to choose one candidate key as the primary key, and to make all references via this primary key.

(Actually, relation and table are not synonymous, but I'm simplifying here. Likewise, tuple and row are not synonymous, although they look identical at first glance.)

The primary reason to declare a primary key is to rule out duplicate keys or missing keys. Sometimes database people choose to leave duplicate and missing key avoidance up to the programmers whose applications write to the database. More commonly, a primary key constraint serves to reflect an error back to a program that violates a primary key constraint.

When a DBMS sets up a primary key constraint, it also builds an index on the primary key. This allows the DBMS to find duplicates quickly, and it also speeds up certain queries that use the key column(s).

At the conceptual level, keys are the means by which the user community identifies instances of entities, whether those entities are persons (employees, travellers, etc.), things (bank accounts, hotel rooms, etc.) or whatever. The key is data and the entity identified by the key is not data. The key can thus be seen a surrogate for the entity in the database.

At the conceptual level, keys are always natural, and never automatically supplied by the system. However, in the real world, keys are often mismanaged, and the consequences of mismanagement are overcome by what is called "common sense". Instilling common sense into an automated system is generally not feasible.

I never really described an index in the above, but it's implicit in what I said. An index is a data structure that serves to map from a key to a pointer. In all the databases you are likely to use, indexes are declared by the database builder (or perhaps a DBA) and managed by the DBMS.

Walter Mitty