How can this method to convert a name to proper case be improved?

+3 A:

I think you'll run again a wall here because usually you won't be able to judge correctly if a conversion is reasonable or not.

Consider your edge cases

JASON MCDONALD -> Jason Mcdonald (Correct: Jason McDonald)

You could simply check for Mc at the beginning of your name and then apply your correction, right? But what if your person is named Mcizck (I made that up of course) and that should not be corrected to Mc Izck but should be left as is?

There is no 100% perfect solution to this problem. What you have here is a natural language problem, and they are really difficult to solve especially for a computer. Cultures are too different to be modeled correctly. Even if you say North-American conventions take precedence you'll have a high percentage of "false positives". Our society consists of a huge mix of cultures, it is simply not adequate to say "North-American takes precedence".

Without handling the edge cases, I guess your current solution will work 99% of the time. All further edge cases should be corrected manually if 100% correct names are really required.

Johannes Rudolph 2010-04-30 16:30:34

A:

~~Well first of all, this code will throw an exception if the name has a ' or - at the end since it will try to capitalize the next (non existent) element in the array.~~ edit, see comment below

Other than that...

I don't think you can really account for DiFranco unless you only account for DiFranco and no other Di's (are there any?). Also, I think it's safe to assume that any Mc deserves a capital next letter. And I also think it's safe to say that de and la when space around them can be lower cased.

But at the end of the day, you seem to be trying to make use of cultures which indicates to me that perhaps you're not just using English. If this is the case then I think you're going to have many more problems than you think. If you're only doing English (or this module is the English module and there are others for other languages), then perhaps you're as close as you're going to get (aside from Mc etc)

statichippo 2010-04-30 16:31:01

DiBella is another 'Di' (fond memories of a girl with that surname from high school ;-) )

DaveDev 2010-04-30 16:33:08

@statichippo I don't think it will cause an exception (just tested it), notice the for loop case `i + 1 < chars.Length` so it will always be 1 character back of from the end.

Kelsey 2010-04-30 16:34:20

woops, didn't notice that. edited

statichippo 2010-04-30 16:44:36

A:

You could

Split on your delimiters " ", "," and "-"
Title case each part
Handle all your edge cases for each phrase

Mark 2010-04-30 16:31:31

+2 A:

There is no general solution to this problem. Even within the common edge cases like "Mc", there are counter examples. I had a friend in college with a "Mc" name who didn't capitalize the following character; apparently it was screwed up in immigration generations ago and they all stick with the on-record-yet-historically-incorrect spelling.

One of my colleague's first names is two traditional first names CamelCased together. You're never going to be able to account for that.

This problem is equivalent to upscaling a video file; you can approximate the best you can but you can't magically generate information that wasn't stored in the first place.

Cory Petosky 2010-04-30 16:32:28

You mean you can't automatically "enhance" and "zoom" 100x into a low quality image like they do on TV?

Nelson 2010-04-30 16:33:47

+1 A:

You can create rules that can get you closer, but you can't get 100%. For example, you can create a list of prefixes (Mc, Di, etc.)

If the prefix ends in a vowel and the next letter is a vowel, lowercase.
If the prefix ends in a vowel and the next letter is a consonant, uppercase.
If the prefix ends in a consonant, the next letter is uppercase.

Etc... but you would probably want to obtain a good list of the prefixes and you'll always have exceptions.

Nelson 2010-04-30 16:41:27

A:

The problem is, as everyone else said, that you're never going to catch every edge case. I was going to suggest going here, downloading the full data set and comparing. But, that data set is all upper-cased. Since this is a one time process, instead, I would download the list from the aforementioned link that has the top 1000 surnames, manually correct them and process your records against that list. Flag those records not processed and see if the number is small enough to be manageable by hand.

Jacob G 2010-04-30 16:43:47

A:

Your question is regarding whether your program can be improved. My response is, "What direction is improvement?" You have two different edge cases that are mutually exclusive. Either you will not catch the people with unusual capitalization rules, or you will not catch the people who do not abide by unusual capitalization rules.

I went to school with someone with a surname of "De La Rosa". Considering your example of de la Hoya, it would be fair to assume that "de la Rosa" is also a surname of someone out there. So if you implement one method to decapitalize "de la", then you miss my friend and I will be sad. And if you don't implement the decapitalization, you miss out on those other people. And heaven forbid you run into some De la Rosa who wouldn't be caught by either method...

So think, what direction do you consider to be "improvement" for your code? If you consider that you should handle edge cases for unusual capitalization and manually account for those who do not abide, the other answers provided will help you along that goal. If you consider that you should manually handle unusual capitalization, then your code needs no change. Either way, you'll have to be manually doing something.

ccomet 2010-04-30 16:51:03

+3 A:

I hope that the reason you're doing this conversion is because the software is changing to allow the users to input their names with the correct casing in the first place.

That said, the only dependable solution would be to notify the users that you have changed the representation of their name. They can then edit the casing if it is incorrect. (You could call them, email them, wait until they use your software the next time, etc.)

If you can't let the users update their own names, the second most dependable method would be to collect lists of (last) names from public sources. If you can find enough of these, you should be able to cover more of the edge cases - simply see if the name exists in your properly-cased list, then use that casing.

John Fisher 2010-04-30 16:53:43

It's a system data migration where the `customers` have no access to this data in the old or new. Just a batch clean up of data before importing to the new system.

Kelsey 2010-04-30 17:15:35

+1 The important thing is we should respect the customer's wishes as to how their names are spelled or capitalized.

Jeffrey L Whitledge 2010-04-30 17:39:42

If this is meant to be a batch cleanup, you shouldn't change the case at all. All caps in all cases implies that case information isn't known. Introducing capitalization as relevant actually dirties your data, because you go from 0% capitalization errors to >0% capitalization errors.

Cory Petosky 2010-04-30 19:17:25

ansaurus

tags:

views:

answers:

How can this method to convert a name to proper case be improved?

related questions