views:

33

answers:

2

I have a database where several tables have a column email for storing email address. Because this is for surveys, a lot of values will be identical, more likely identical over names, addresses, etc.

Should I just have a master Emails table, and then an email_id column? That way I'm only storing email strings once, instead of multiple times in tables. But if I wanted to be sure that I was only storing unique emails, isn't there some limitation to the length that the index will check a string for uniqueness, therefore I could be storing multiple copies of long email addresses?

In the survey database, we store the email address they submit. If they choose to join a mailing list, we store those uniquely (one email per membership) in a mailing list membership table, so there could be multiple of the same addresses in that table, depending on how many clubs they join. Now I'm adding a table to track bounceback emails, because that's a property of the email address, not the survey or mailing list membership. And I'm thinking, "This is a lot of string joins!"

Is this a form of the "One True Lookup Table"?

+3  A: 

Should I just have a master Emails table, and then an email_id column?

It doesn't actually matter very much.

isn't there some limitation to the length that the index will check a string for uniqueness, therefore I could be storing multiple copies of long email addresses?

No. There's no limitation. Unique means unique, not "unique to some random limit".

And I'm thinking, "This is a lot of string joins!"

So? String joins aren't horribly slow. If you can prove that these string joins are the worst bottleneck in your application, then replacing string joins with integer FK's might speed things up.

Until you can prove that these string joins are your worst problem, don't worry about them.

Worry about getting the business rules right for how an email address is used. Don't optimize until you can prove you have a problem.

S.Lott
I hear what you're saying, but by the time these bottlenecks become a problem, the whole running live app is built around them, and it's very difficult to convince a boss to let me take a lot of time to carefully deconstruct everything on the live site with the risk of downtime, data loss, etc. just to make things right if/when it becomes an issue. I'm already not liking my join on the bouncebacks; there will be future tables!The fact is that I'm developing *now*, I can do it right *now*, and if I do do it correctly, now, I won't have to open the patient up later to retrieve the forceps.
@user151841: There are numerous things you can waste time on right *now*. You can waste time micro-optimizing the `++` operator. You can waste time micro-optimizing loops that end with `<` or `<=`. Things will not abruptly stop running when string comparisons on the bottleneck. You'll notice degradation, do profiling, discover the problem. ALL performance optimizations involve rewrites. NO performance "optimization" should be done initially. Get the data model **right**. That's what matters most.
S.Lott
A: 

If the issue is simlpy "members have email addresses", then I'd keep the email address directly associated with the member, and not normalize it out to an Emails table. This is because not all "members" will necessarily share emails.

  • If (and don't ask me why, I don't understand end-users) two memebers share the same email address, what happens when one of them changes their address--but the other one doesn't want to?

  • Second case, what if I have two memberships in your system, both with the same email address, and then I want to change one of them to a different address? (Don't ask me why, I'm an end-user and I already said I don't understand end-users.)

This would cover a fairly simple and straight-forward situation. If your system is different such that you need more or tighter control over emails, normalization may work for you. The trick is in determining if, from the perspective of the data, it's repeating data that can be normalized, or "distinct" data that just happens to contain some duplicating values.

The Bounceback email table fits in either way, as its a separate kind of data (or kind of email address).

As for strings and index length, these days if an RDBMS claims that it can index or uniquely index a string up to X characters (just how long do email addresses get?) you can rely on it to do so. It may not perform too quickly, as it has to process X bytes of data per key instaed of 4 (typical integer storage size), but it will work.

Philip Kelley
I knew there was something I was overlooking... we can't change email addresses :( One of the requirements is that one email address can sign up kids or spouse :P
Well, we'd have to do it by name + email address.