Duplicating / Unnecessary data in a SQL database.

views:

183

answers:

+3 Q:

Duplicating / Unnecessary data in a SQL database.

When designing a new application/database for multiple users, it usually involves having a table for user management.

The table usually always has Username as well as ID....

Just to check that I am not being an idiot (well, you may still think that!) I have just downloaded and taken a look at the schema of both MediaWiki and PHPBB, and they do the same.

When I first learnt about relational databases, I was always told that a big rule was never to duplicate data or do anything Unnecessary.

So, why is it that we have ID as the primary key instead of Username?

I understand the reason if it was not unique (such as the SO series of sites), however in these applications, it is.

The only thing I could think of is it is quicker to do a Select * from xxx where ID="454" instead of Select * from xxx where name="some_really_long_name" or because having really long names would increase the database size a lot.

Are these the only reasons, or have I missed something here?

+1 A:

At least one reason - you can change a Username inexpensively if it's not the primary key.

martin clayton 2009-11-25 01:19:52

+1, Brilliant, Thank you!

Wil 2009-11-25 01:27:24

+2 A:

This is what is known as a Surrogate Key. It is not duplicating data, it is standing in for data. Mostly it exists simply because it's easier, especially when the natural key is multiple fields, and then you have to use that as a foreign key in another table.

They are common, but there is healthy debate about if they are "correct" or not. Personally, I use them just because it simplifies life. Another bonus is that if you use a natural key and someone typos the key, you end up having to do a cascading update across multiple tables if it's a foreign key. Since it is impossible to typo a surrogate key (it being system generated), this is never an issue.

Donnie 2009-11-25 01:20:27

+1, Brilliant, Thank you! - I do a lot of work without knowing the terminology!

Wil 2009-11-25 01:27:55

It is worth noting that the natural key is almost always still a unique index, and the surrogate key is used as the PK.

Donnie 2009-11-25 01:31:07

"Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them" by Codd. I'd argue that a user id might even be used by the user in some applications, so it's not really a surrogate key. Sometimes there are some special user ids, like for instance -1 in this here community.

Marius Burz 2009-11-25 01:43:18

+1 A:

One key advantage is by having a separate ID you are able to change the username without breaking any of the foreign key relations.

Additionally, it is good RDBMS practice to have primary keys that have no meaning - a username obviously has meaning

DrewM 2009-11-25 01:20:44

+1, Brilliant, Thank you!

Wil 2009-11-25 01:26:21

+8 A:

Reasons for using an integer non-intelligent primary key in place of a unique text key:

Speed of JOINs and other queries.
Ability to change the text key value and preserve the integrity of the database without have to update every table in which the key is referenced.
Increased code efficiency and reduced memory overhead when your application must build a list of key values in memory.
Decreased size of tables which reference the key.

Larry Lustig 2009-11-25 01:21:54

+1, Brilliant, Thank you!

Wil 2009-11-25 01:28:58

Wil 2009-11-27 21:52:12

4 won. Congratulations and thank you.

Wil 2009-11-27 21:56:15

+3 A:

A username can be easly changed/renamed. It's very useful.
It's marginally faster
Overall it saves space/memory(true, tiny amounts)

PS: don't forget to add a unique constraint on the username column.

Marius Burz 2009-11-25 01:25:24

+1, Brilliant, Thank you!

Wil 2009-11-25 01:28:26

+7 A:

Keys serve two disparate purposes, one is to prevent insertion of duplicate rows.... that doesn't mean that the data values are all the same, it means that the two rows represent the same real-world entity. Only a meaningful natural key can accomplish this.
The second purpose is to act as targets of foreign key columns in dependent tables.
For this, the narrowest, (smallest number of bytes) key will generate the best performance on the index that goes with the key and is used when performing searches.

So when the natural key consists of multiple columns, or is very wide, then sometimes it is advisable to create a second alternate, or surrogate key to be used as the target of FK references in other tables. This is generally an internally created value, created within the database, which is not exposed outside the application or system, perhaps not even outside the database component itself.

If that was then the only key, as it is not a meaningful, or natural key, it is totally insufficient to ensure data consistency, as two rows which represent the same entity, and differ in all their attributes only by the meaningless surrogate key, can still be inserted into the table.
Therefore, in such situations, it is a good practice to have both keys on the table.

In addition to often being used to increase performance, surrogate keys have the additional advantedge (Because they are non-meaningful) of not ever having to be changed. Coming up with the best possible natural key, that accurately and uniquely identifies the entity, and will never need to be changed, is an art form and can easily be done poorly. (SSANs are a canonical example) Then, if the real world entity changes any of the values used in a poorly designed natural key, and you are using it as the only key, (and therefore as FKs elsewhere) you will have to change the value everywhere in the database, including in all the other tables where you are using it as a foreign key.

Charles Bretana 2009-11-25 01:28:52

+1, Brilliant, Thank you! You say natural key - is this the same as what Donnie called a Surrogate key?

Wil 2009-11-25 01:32:01

It is the opposite of a surrogate key.

Larry Lustig 2009-11-25 01:32:56

@Larry, Thanks, just re read this - looks like I read a bit to fast the first time... Surrogate key is what I have been using for years - just never knew it was called `surrogate key` and natural key refers to a standard unique column?... I have been doing database work for years, but I "just get on with it" I know what is needed and what to do in places, just don't always know the lingo/definitions/names for certain things.

Wil 2009-11-25 01:40:30

@Charles: A natural key is only a key that exists, as-is or as a commonly held convention. For example - the abbreviation of a US state makes for an ideal key because a) it's unique and b) everyone who knows US states will know the abbreviation. An artificial (AKA surrogate) key does not have any relation to the data - ID #1234 could refer to anything.

OMG Ponies 2009-11-25 03:12:56

+1 - Good explanation. The only other thing that I would add is that an integer *surrogate* key will never need to be changed in the tables that use it as a FK. If you ever need to change the ID then you can do so in the root table without the worry of propagating the change throughout the database. I work with exactly this kind of system and, as much as we discourage people from changing IDs, it simply must be done sometimes.

Mark Brittingham 2009-11-25 03:42:11

@Mark, keerect, thx. Added this point to my answer.

Charles Bretana 2009-11-25 14:57:45

@OMG, Yes, the use of the word "meaningful" hhopefully communocates this point. It is mostly clear, although there are edge cases where the distinction can be grey.. Telephone numbers used to identify a person, or zip codes to identify an address, or SSANs, all fall into the grey area... As a number that can change, they certainly are not permanently and meaningfully "connected" to the entity involved, yet they do "as a pop convention" identify that entity. A good test for a surrogate is something that is generated internally, and not exposed outside the database.

Charles Bretana 2009-11-25 15:04:28

Wil 2009-11-27 21:50:45

Sorry, 4 won! but many thanks.

Wil 2009-11-27 21:54:51

+4 A:

...why is it that we have ID as the primary key instead of Username?

The first rule of data modelling is that you do not expose your keys. This is because:

You want to insulate your system from displaying information to the userbase that can not be altered.
There's a security concern: now your user could be spoofed, assuming you don't take precautions
If you have to migrate to a new database, same vendor or otherwise - the transition should be seamless. You'll likely have to update references, but users will notice if they were 1234 yesterday, and 100234 today. People have enough trouble remembering much less creating a strong password...
Integer based keys are faster, and take less space. These are called artificial keys.
It's a nice feature to allow someone to change their username; when your system deals with peoples names changing because of being married (or divorced) it is a necessity.

Natural vs Artificial Keys

A natural key is only a key that exists, as-is or as a commonly held convention. For example - the abbreviation of a US state makes for an ideal key because:

it's unique
everyone who knows US states will know the abbreviation.

An artificial (AKA surrogate) key does not have any relation to the data - ID #1234 could refer to anything.

In theory, use artificial keys as much as possible. Because ID 1234 could mean anything, when you use natural keys there's less need to have to join to know what a given id value means. In reality, because of both performance and that natural keys aren't very common - use artificial keys.

OMG Ponies 2009-11-25 03:29:17

+1, Many thanks!

Wil 2009-11-27 21:46:18

Wil 2009-11-27 21:53:18

Thank you for the answer but sorry, 4 won.

Wil 2009-11-27 21:55:30

ansaurus

tags:

views:

answers:

Duplicating / Unnecessary data in a SQL database.

Natural vs Artificial Keys

related questions