views:

2707

answers:

21

Is email address a bad candidate for primary when compared to auto incrementing numbers. Our web application needs the email address to be unique in the system. So, I thought of using email address as primary key. But, my colleague suggests that string comparison will be slower to integer comparison. Is it a valid reason to not use email ids as primary key.

We are using postgres Thanks

A: 

you can boost the performance by using integer primary key.

xport
+3  A: 

yes, it is better if you use an integer instead. you can also set your email column as unique constraint.

like this:

CREATE TABLE myTable(
    id integer primary key,
    email text UNIQUE
);
ibram
Why is it "better"? Any reasons or sources?
Sjoerd
experience and sense :)
ibram
Can you elaborate on that?
Sjoerd
+1  A: 

you should use an integer primary key. if you need the email-column to be unique, why don't you simply set an unique-index on that column?

oezi
+108  A: 

String comparison is slower than int comparison. However, this does not matter if you simply retrieve a user from the database using the e-mail address. It does matter if you have complex queries with multiple joins.

If you store information about users in multiple tables, the foreign keys to the users table will be the e-mail address. That means that you store the e-mail address multiple times.

Sjoerd
+1 for mentioning foreign keys; that's the main problem
sleske
+2  A: 

Another reason why integer primary key is better is when you refer to email address in different table. If address itself is a primary key then in another table you have to use it as a key. So you store email addresses multiple time.

klew
+2  A: 

Your colleague is right: Use an autoincrementing integer for your primary key.

You can implement the email-uniqueness either at the application level, or you coudl mark your email address column as unique, and add an index on that column.

Adding the field as unique will cost you string comparision only when inserting into that table, and not when performing joins and foreign key constraint checks.

Of course, you must note that adding any constraints to your application at the database level can cause your app to become inflexible. Always give due consideration before you make any field "unique" or "not null" just because your application needs it to be unique or non-empty.

Here Be Wolves
"Always give due consideration before you implement requirement x just because your application needs requirement x." -- the worst piece of advice I've read in quite some time.
onedaywhen
I'm not convinced by your "argument" -- in real life there will often be situations when some essential data (e.g, a phone number) will not be available immediately. If such a field is marked as NOT NULL in a database, it will require the users to pollute the data with dummy fields (like 123) instead of leaving it empty. It would be more practical to let the application handle the constraints (and in this case, the app could flag an empty field as a action item).
Here Be Wolves
I agree that defining a field "not null" should be done cautiously. Requirements like "we always need the customer's phone number" should be considered carefully. Might it not be desirable at times to create a customer record even though we don't know the phone number right now, and go back and get it later? But "this field must be unique" is a different category. I can't imagine saying "It's okay for two employees to have the same social security number, we'll figure it out later." How would you ever straighten out the data?
Jay
@Here Be Wolves: I knew a woman once who didn't have her own phone number. What do you do then?
David Thornley
+2  A: 

I am not too familiar with postgres. Primary Keys is a big topic. I've seen some excellent questions and answers on this site (stackoverflow.com).

I think you may have better performance by having a numeric primary key and use a UNIQUE INDEX on the email column. Emails tend to vary in length and may not be proper for primary key index.

some reading here and here.

Saif Khan
+58  A: 

I will also point out that email is a bad choice to make a unique field, there are people and even small businesses that share an email address. And like phone number emails can get re-used. [email protected] can easily belong to John Smith one year and Julia Smith two years later.

Another problem with emails is that they change frequently. If you are joining to other tables with that as the key, then you wil have toupdate the other tables as well which can be quite a performance hit when an entire client company changes their emails (which I have seen happen.)

HLGEM
+1 for mentioning the cascading update problem. That's why friends let friends only use surrogate keys ;-).
sleske
@sleke I love the phrase:"That's why friends let friends only use surrogate keys
HLGEM
ah, I don't like the saying at all... surrogate keys can also be the source of problems; yes, the application will be more robust to change of business and/or integrity rules, however the information can get lost a bit easier and the identity of records becomes less clear. so I would _not_ recommend a rule of a thumb here...
Unreason
I have never in 30 years of database work lost records due to a surrogate key.
HLGEM
The OP states very clearly: "Our web application needs the email address to be unique in the system" so your observation about "people and... businesses that share an email address" do not apply in this case.
onedaywhen
How could people share an email address? If email addresses are not unique, the whole email system breaks down. Unless you mean that, say, a husband and wife might share a single email account. But in that case, are they really two different customers? Now it depends on what your definition of a "customer" is. I've used email addresses as primary keys and never had a problem with that. Maybe in some cases it would be an issue. Have to look at the context.
Jay
@onedaywhen and @jay, just because you think it shoud be unique doen't make it unique. And yes a husband and wife might be different customers. Just becasue you haven't run into this before doesn't mean it won't happen. I have run into it and it does happen which is why email should never be allowed to be considered unique whether you think it should be or not. This is the kind of requirement you push back because it is inherently wrong.
HLGEM
@HLGEM: I don't want to get into an endless argument, but you can't say that a proposed key is not unique based on hypotheticals without knowing the context. e.g. from the phone company's point of view, a telephone number uniquely identifies a customer, by definition. Yes, you can say, "But what if there are two or three people who might answer when you call that number?" But this is irrelevant. From the phone company's point of view, by definition this is one customer. (continued ...)
Jay
(continued) Likewise, if you are building a system that is largely concerned with email communications -- perhaps a message dispatching system, or a notification forwarding system -- then it is likely that by definition, an email address uniquely identifies a user. If multiple people share that email address, that is irrelevant. They are a single message destination, therefore, they are a single user. "User" and "customer" do not have to be synonyms for "individual human being".
Jay
More likely, the e-mail address is going to be used as a login ID. That might inconvenience 1% of the customers but the other 99% will have a much happier experience relative to having to choose and remember a unique user name. The only other alternative is OpenID which most laypeople don't have.
Aaronaught
Many websites I frequent require that I supply a unique email address to identify me: Amazon, Google, Microsoft, etc. Please advise how I can "push back" this requirement to them.
onedaywhen
... though I'm not sure I want to: I assume they, as I do, like the idea that an email address is verifiable (they require that I respond to an email they send) and familiar to users (i.e. I can remember my own email address!), both of which are properties of a good key.
onedaywhen
+5  A: 

It is pretty bad. Assume some e-mail provider goes out of business. Users will then want to change their e-mail. If you have used e-mail as primary key, all foreign keys for users will duplicate that e-mail, making it pretty damn hard to change ...

... and I haven't even started talking about performance considerations.

meriton
How would changing email addresses cause there to be duplicates? Unless user A changes his email address, and then user B changes his email to be the same as user A's old value, and your updates are not done in sequence. Remotely possible, I guess.
Jay
A foreign key reference, by definition, contains the value of the primary key of the row it refers to. Put differently, it duplicates the value of the primary key. (So the duplicating is not caused by changing the value. But changing is harder due to this duplication, and the constraint enforcing it).
meriton
+1 for the line "Assume some e-mail provider goes out of business."
Reddy
+42  A: 

the primary key should be unique and constant

email addresses change like the seasons. Useful as a secondary key for lookup, but a poor choice for the primary key.

Steven A. Lowe
A property of a good key is that is should be stable but NOT necessarily immutable.
onedaywhen
@onedaywhen: Yep! Otherwise why would SQL support cascading updates?
Bill Karwin
if you have a choice, go for constant/immutable keys; less work for you down the road; just because SQL supports cascading updates doesn't mean it's always a good idea!
Steven A. Lowe
"change like the seasons"... I just got done with a project where the users' e-mails changed frequently, and without any notice to application managers or developers.
Tim Rourke
+1: relying on cascading updates is bad practice (it brakes db normalization)
Vincent Malgrat
@Vincent Malgrat: "cascading updates... brakes db normalization" -- methinks you have misunderstood the concept of normalization!
onedaywhen
@onedaywhen: in a normalized db, you should not have the same information repeated on multiple rows. An email address as primary key will be repeated on all foreign keys.
Vincent Malgrat
@Vincent Malgrat: thanks for confirming that you have indeed misunderstood the concept of normalization. "you should not have the same information repeated on multiple rows" -- did you really mean to say "information"?! A compound key will usually involve values repeated on multiple rows. For a foreign key, values are referenced rather than "repeated", big difference. A single-column domain with two values (e.g. 'Yes' and 'No') will have the same values on multiple rows in a referencing table if it has three or more rows. This is really basic stuff!
onedaywhen
+2  A: 

Yes, email address is a terrible primary key!

El Ronnoco
A: 

If you have a non int value as primary key then insertions and retrievals will be very slow on large data.

Amareswar
+19  A: 

Disadvantages of using an email address as a primary key:

  1. Slower when doing joins.

  2. Any other record with a posted foreign key now has a larger value, taking up more disk space. (Given the cost of disk space today, this is probably a trivial issue, except to the extent that the record now takes longer to read. See #1.)

  3. An email address could change, which forces all records using this as a foreign key to be updated. As email address don't change all that often, the performance problem is probably minor. The bigger problem is that you have to make sure to provide for it. If you have to write the code, this is more work and introduces the possibility of bugs. If your database engine supports "on update cascade", it's a minor issue.

Advantages of using email address as a primary key:

  1. You may be able to completely eliminate some joins. If all you need from the "master record" is the email address, then with an abstract integer key you would have to do a join to retrieve it. If the key is the email address, then you already have it and the join is unnecessary. Whether this helps you any depends on how often this situation comes up.

  2. When you are doing ad hoc queries, it's easy for a human being to see what master record is being referenced. This can be a big help when trying to track down data problems.

  3. You almost certainly will need an index on the email address anyway, so making it the primary key eliminates one index, thus improving the performance of inserts as they now have only one index to update instead of two.

In my humble opinion, it's not a slam-dunk either way. I tend to prefer to use natural keys when a practical one is available because they're just easier to work with, and the disadvantages tend to not really matter much in most cases.

Jay
+1 for showing the pros AND cons.
Matthew Wood
+1 for noting that mutable referenced keys are a PITA
Conrad Frix
@Conrad: Although, he does point out that it's not a PITA if you have a engine that supports ON UPDATE CASCADE. It's a non-issue at that point code-wise; the only real issue is how extensive is the update and how wide is the key. Email address may be a bit much, but a CASCADE UPDATE for a PK of 2-character country code isn't a big deal.
Matthew Wood
@Matthew IMHO its still a PITA. For example assume that when you designed your country table there were only two tables that referenced it, no biggy, But over time it became 20 tables each with hundreds of thousands of records. Some with the reference some without. This makes a single logic write end up being tens of thousands of writes, and it doesn't make it to all the tables because someone forgot a reference when the added the table. This is exact thing happened to me on a 2 char country code table I kid you not.
Conrad Frix
Jay
Advantage 1 and 3 are premature optimizations, advantage 2 is a very minor benefit and is completely overcome by any decent query tool.
Ash
@Ash: Thee's a difference between "optimizatin" and "premature optimization". But okay, by the same reasoning, all the disadvantages I've seen anyone mention are premature optimizations. So where does that leave you? As to #2, I find typing in extra joins when trying to do ad hoc queries to be a major pain. Records often have multiple foreign keys so you may need several joins to get to comprehensible data. If by "decent query tool" you mean one that figures out what data you want to see without you telling it and magically does the joins for you, I'd like to see how that works.
Jay
A: 

It depends on the table. If the rows in your table represent email addresses, then email is the best ID. If not, then email is not a good ID.

Lajos Arpad
+1  A: 

Use a GUID as a primary key... that way you can generate it from your program when you do an INSERT and you don't need to get a response from the server to find out what the primary key is. It will also be unique accross tables and databases and you don't have to worry about what happens if you truncate the table some day and the auto-increment gets reset to 1.

JoelFan
Unless you care little to nothing about performance then use a GUID. It's no-no #1 if you are building a system that will need to scale
Micah
no... see http://davybrion.com/blog/2009/05/using-the-guidcomb-identifier-strategy/
JoelFan
Said in true Microsoft-Kool-Aid-drinking fashion!
Gary Chambers
A: 

If it's simply a matter of requiring the email to be unique then you can just create a unique index with that column.

Micah
A: 

Personally, I do not use any information for primary key when designing database, because it is very likely that I might need to alter any information later. The sole reason that I provide primary key is, it is convenience to do most SQL operation from client-side, and my choice for that has been always auto-increment integer type.

tia
+2  A: 

I don't know if that might be an issue in your setup, but depending on your RDBMS the values of a columns might be case sensitive. PostgreSQL docs say: „If you declare a column as UNIQUE or PRIMARY KEY, the implicitly generated index is case-sensitive“. In other words, if you accept user input for a search in a table with email as primary key, and the user provides "[email protected]", you won't find “[email protected]".

xl-t
Worth mentioning in this connection that [email protected] and [email protected] may be the same mailbox or may be different mailboxes and you have no way of telling - there's nothing in the spec to say whether the local-part is case-sensitive.
telent
You're right, that depends on the mail server.
xl-t
A: 

Yes, it is a bad primary key because your users will want to update their email addresses.

Lone Coder
+1  A: 

No one seems to have mentioned a possible problem that email addresses could be considered private. If the email address is the primary key, a profile page URL most likely will look something like ..../Users/[email protected]. What if you don't want to expose the user's email address? You'd have to find some other way of identifying the user, possibly by a unique integer value to make URLs like ..../Users/1. Then you'd end up with a unique integer value after all.

Simen Echholt
A: 

Email is a good unique index candidate, but not for primary key, if it is a primary key, you will be no able to change the contact's emails address for example. I think your join querys will be slower too.

Chocolim