tags:

views:

30

answers:

2

Converting a database of people and addresses from ALL CAPS to Title Case will create a number of improperly capitalized words/names, some examples follow:

MacDonald, PhD, CPA, III

Does anyone know of an existing script that will cleanup all the common problem words? Certainly, it will still leave some mistakes behind (less common names with CamelCase-like spellings, i.e. "MacDonalz").

I don't think it matters much, but the data currently resides in MSSQL. Since this is a one-time job, I'd export out to text if a solution requires it.

There is a thread that posed a related question, sometimes touching on this problem, but not addressing this problem specifically. You can see it here:

http://stackoverflow.com/questions/230138/sql-server-make-all-upper-case-to-proper-case-title-case

Thank you. Chris Adragna Orlando, FL

+2  A: 

Don't know if this is of any help

private static function ucNames($surname) {
//  ( O\' | \- | Ma?c | Fitz )  # attempt to match Irish, Scottish and double-barrelled surnames
    $replaceValue = ucwords($surname);
    return preg_replace('/
                        (?: ^ | \\b )       # assertion: beginning of string or a word boundary
                        ( O\' | \- | Ma?c | Fitz )  # attempt to match Irish, Scottish and double-barrelled surnames
                        ( [^\W\d_] )        # match next char; we exclude digits and _ from \w
                        /xe',
                        "'\$1' . strtoupper('\$2')",
                        $replaceValue);
}

It's a simple PHP function that I use to set surnames to correct case that works for names like O'Connor, McDonald and MacBeth, FitzPatrick, and double-barrelled names like Hedley-Smythe

Mark Baker
That's helpful. A large part of the problem is just that, and some of the other things to chase down are easier to find/replace such as III, CPA, MD, etc.
Chris Adragna
A: 

Here is the answer I was looking for:

There is a data company, Melissa Data, who publishes some API and applications for database cleanup -- geared mostly around the direct marketing industry.

I was able to use two applications to solve my problem.

  1. StyleList: this app, among other things, converts ALL CAPS to mixed case and in the process it does not dirty up the data, leaving titles such as CPA, MD, III, etc. in tact; as well as natural, common camel-case names such as McDonalds.
  2. Personator: I used personator to break the Full Name fields into Prefix, First Name, Middle Name, Last Name, and Suffix. To be honest, it was far from perfect, but the data I gave it was pretty challenging (often no space separating a middle name and a suffix). This app does a number of other usefult things as well, including assigning gender to most names. It's available as an API you can call, too.

Here is a link to the solutions offered by Melissa Data:

http://www.melissadata.com/dqt/index.htm

For me, the Melissa Data apps did much of the heavy lifting and the remaining dirty data was identifiable and fixable in SQL by reporting on LEFT x or RIGHT x counts -- the dirt typically has the least uniqueness, patterns easily discovered and fixed.

Chris Adragna
Is this a sales pitch?
Mr. Matt
Oh, no, certainly not. I understand your asking. The solutions from the Melissa Data company include API (or just apps). For my specific needs it saved an awful lot of time by not only *fixing* data I broke by changing case, but *finding* it, even. It otherwise would require a lot of record by record inspection. Consequently, though, I did have to still do much manipulation post-cleanup due to extremely dirty data from public records. (absence of delimeters, etc.)
Chris Adragna