Is normalizing a person's name going too far?

views:

1258

answers:

+8 Q:

Is normalizing a person's name going too far?

You usually normalize a database to avoid data redundancy. It's easy to see in a table full of names that there is plenty of redundancy. If your goal is to create a catalog of the names of every person on the planet (good luck), I can see how normalizing names could be beneficial. But in the context of the average business database is it overkill?

(Of course I know you could take anything to an extreme... say if you normalized down to syllables... or even adjacent character pairs. I can't see a benefit in going that far)

Update:

One possible justification for this is a random name generator. That's all I could come up with off the top of my head.

+1 A:

I would say yes, it is going too far in 95%+ of the cases.

Nebakanezer 2009-04-23 16:38:10

Okay, I'll bite; what are the other 5% of cases? :-)

McWafflestix 2009-04-23 16:41:09

I'm biting too, don't you mean 5%- ;-)

Patrick McDonald 2009-04-23 16:44:33

Half of the Koreans are Kim's, Lee's or Park's, and there are only about 250 different surnames (they could probably even fit only one byte!)

Quassnoi 2009-04-23 16:44:39

Good point. But fortunately, modern DBMS recognize such extreme fields and adapt their storage and indexing behavior to such data patterns, so it is handled on a lower level than the data model. So formally, every Korean can have a distict first name, even if most don't, and still the db handles them as efficient as a bot army with numbered, unique first names. (Or even more efficient.)

TheBlastOne 2010-07-19 16:34:28

TheBlastOne: Can you please name a DBMS that can recognize that there are only 250 different surnames in my table and optimize them to only require 1 byte of storage? I'm not familiar with one.

Gabe 2010-07-19 16:42:02

+2 A:

Yes, definitely overkill. What's a few dozen bytes betewen friends?

Alnitak 2009-04-23 16:38:21

No-one should ever need more than 640K

Colin Pickard 2009-04-30 13:44:52

+1 A:

Generally yes. Normalizing to that level would be going to far. Depending on the queries (such as phone books where searches by last name are common) it might be worthwhile. I expect that to be rare.

Adam Davis 2009-04-23 16:38:33

And even in the phone book scenario a good index would be all that's needed, but there are still very narrow uses where normalizing the name makes sense.

Adam Davis 2009-04-23 16:39:41

+1 A:

Yes. I cannot think of an instance where the benefits outweigh the problems and query complications.

billb 2009-04-23 16:38:50

+42 A:

Yes, it's an overkill.

People don't change their names from Bill to Joe all at once.

Quassnoi 2009-04-23 16:39:18

This really illustrates the point. The reason for normalization is so that if you have to change something that's the same all over the database that you don't have to change a whole bunch of records. But you would never have to change the name of everybody names Bill to Joe.

Kibbee 2009-04-23 16:48:00

Nice job, great direct point!

curtisk 2009-04-23 17:07:22

I'm not disputing the claim that it's overkill for most cases but I think your reasoning against it is flawed. Just playing devil's advocate here... Say you have a FirstName table and a FullName table. if a single person changed their name from Bill to Joe wouldn't you update the foreign key in the FullName table and not the record in the FirstName table?

codeelegance 2009-04-23 17:08:56

@Kenneth: Denormalization is just for the cases I described: it simplifies mass updates. If a company changes its name, you just change one field in the joined table and the SQL query the badge printing machine issues returns new results. Joins are bad for performance and complexity, the only thing they are good for is mass updates.

Quassnoi 2009-04-23 17:20:08

REPLACE('Denormalization', 'Normalization') in the comment above

Quassnoi 2009-04-23 17:22:42

I think there's more to normalization than mass updates. You look at nearly any contact database and you'll see City and State tables but I can't remember the last time a city or state changed their name.

codeelegance 2009-04-23 17:39:53

@Kenneth: name is not the only property of a city. There is area code, population, and thousand more properties, which can be changed all right. As for a name as an entity, it does not have other properties but its spelling.

Quassnoi 2009-04-23 18:16:06

@Kenneth: But you are right, normalization is good for one more thing: indexability. Creating an index on a surrogate INTEGER primary key is much easier than on a random property string.

Quassnoi 2009-04-23 18:17:09

@Kenneth: By the way, my grandgrandmother from this post: http://stackoverflow.com/questions/768123/how-would-you-live-without-mysql/768135#768135 was born in Saint-Petersburg, married in Petrograd, defended Leningrad during the blockade, and died in Saint-Petersburg. And she never moved out of the city.

Quassnoi 2009-04-23 18:28:08

I generally haven't seen a need to normalize the name, mainly because that adds a performance hit on the join that will always be called, and doesn't give any benefit.

If you have so many similar names, and have a storage problem then it may be worth it, but there will be a performance hit that would need to be considered.

James Black 2009-04-23 16:39:42

I would say it is absolutely overkill. In most applications, you display folks' names so often, every query involved with that is going to look that much more complex and harder to read.

Dana 2009-04-23 16:39:59

Yes, it is. It is commonly recognized that just applying all of the Rules of Normalization can cause you to go way too far and end up with an overnormalized database. For example, it would be possible to normalize every instance of every character to a reference to a character enumeration table. It's easy to see that that's ridiculous.

Normalization needs to be performed at a level that is appropriate for your problem domain. Overnormalization is as much a problem as undernormalization (although, of course, for different reasons).

McWafflestix 2009-04-23 16:40:31

+23 A:

Database normalization usually refers to normalizing the field, not its content. In other words, you would normalize that there only be one first name field in the database. That is generally worthwhile. However the data content should not be normalized, since it is individual to that person - you are not picking from a list, and you are not changing a list in one place to affect everybody - that would be a bug, not a feature.

Yishai 2009-04-23 16:40:46

+1 A:

No, but you might want to normalise to a canonical record for a customer (so you don't get 5 different entries for 'Bloggs & Co.' in your database. This is a data cleansing issue that often bites on MIS projects.

ConcernedOfTunbridgeWells 2009-04-23 16:42:04

+2 A:

Maybe if you work in the Census office it might make sense. Otherwise, see every other answer :)

Patrick McDonald 2009-04-23 16:42:40

+5 A:

How do you normalize a name? Not all names have the same structure. Not all countries or cultures use the same rules for names. A first name is not necessarily just a first name. People have variable numbers of names. Some countries don't have the simple pair of firstname/lastname. What if my first name just so happens to be your last name, should they be considered the same in your database? If not, then you get into the problem that last name might mean different things in different countries. In most countries I know of, it is a family name. Your last name is the same as at least one of your parents' last name. On Iceland, it is your father's first name, followed by "son" or "daughter". So the same last name will mean completely different things depending on whether you encounter it in Iceland and the US.

In some cultures it is common when getting married, for the woman to take her husband's last name. In other cultures, that's completely optional, or might even work the opposite way.

How can you normalize this? What information would it gain you? If you find someone in your database who has "Smith" as the last word making up their name, what does that tell you? It might not be their family name. It might only be part of the family name. It might be an honorary in some language, but which according to their culture, should be considered part of the name.

You can only normalize data if it follows a common structure.

jalf 2009-04-23 16:42:50

There might be a case where being able to link married/maiden names would be useful.
Recently had a case where I had to rename thousands of emails in exchange because somebody got divorced and didn't want any emails listing her as [email protected]

Martin Beckett 2009-04-23 16:43:51

+2 A:

You often don't go over fourth form normalization in a database. Therefore seventh form normalization is quite a bit overboard. The only place this might even be a remotely plausible idea is in some kind of massive data warehouse.

David 2009-04-23 16:45:03

No need to normalize to that level unless the names make up a composite primary key and you have data that is dependant on one of the names (e.g. anyone with the surname Plummer knows nothing about databases). In which case, by not normalizing, you would violate second normal form.

John Plummer 2009-04-23 16:53:39

I agree with the general response, you wouldn't do that.

One thing comes to mind though, compression. If you had a billion people and you found that 60% of first names were pulled from 5 very common names, you could use some tricky bit manipulation to reduce the size very significantly. It would also require very customized database software.

But this isn't for the purpose of normalization, just compression.

Bill K 2009-04-23 17:11:17

Or simple a custom storage engine.

maxwellb 2010-06-27 18:43:39

>>Or simple a custom storage engine<< which is part of most modern bigger DBMSs today.

TheBlastOne 2010-07-19 16:35:37

You should normalize it out if you need to avoid the delete anomaly that comes with not breaking it out. That is, if you ever need to answer the question, has my database ever had a person named "Joejimbobjake" in it, you need to avoid the anomaly. Soft deletes is probably a much better way than having a comprehensive first name table (for example), but you get my point.

JP Alioto 2009-04-23 17:20:31

In addition to all the points everyone else has made, consider that if you were implementing a data entry operation (for example), and were to insert a new contact, you would have to search your first name and last name tables to locate the correct Id's and then use those values. But then this is further complicated by the occasion when the name is not on the FN and/or LN tables, then you have to insert the new first/last name and use the new id(s).

And if you think that you have a comprehensive list of names, think again. I work with a list of over 200k unique first names and I'd guess it represents 99.9% of the US population. But that .1% = a lot of people. And don't forget the foreign names and misspellings...

Marc Bernier 2009-04-23 19:54:13

ansaurus

tags:

views:

answers:

Is normalizing a person's name going too far?

related questions