views:

22

answers:

1

I have multiple instances of people entities which are often times the same person. Where the address First-Last is the same at the same address, it's a no-brainer to merge/rollup them.

However, due to data entry inconsistencies, there must be a way to deviate a bit from the exactness. I think the credit card industry does this a little bit: zip plus street number, or street name? ...something of that nature.

In order to solidify my matching, I cleaned up the address strings, trying to make them as standard as possible ("Hwy" --> "Highway", etc.).

I need something that still will make matches on records that would look obvious to a person just by glancing at them, but fails to have exactly matching data.

Here is my initial thought, concatenate a string made up of the following:

First Initial
LEFT8 of the LastName (allows inconsistent endings, such as "Esq." or "CPA")
LEFT3 of Zip
Street Number
LEFT8 of the StreetName (not Addr1 -- "Oak" for "8 N Oak Street")

Did I miss something here? I think I made it loose enough to overcome normal data entry inconsistencies, but tight enough to avoid incorrect matches.

A: 

I was involved in a project to clean up name and address data for a large financial institution. We achieved a success rate automatically of about 98.4% but unfortunately this still left about 150,000 mismatches.

The way we attacked the problem was to (over time) build up a rule base of the types of errors that could occur, and extending the fuzziness of the logic to cover identified classes of error.

A significant amount of data cleansing can indeed be done by reference to (UK) post codes and house number and/or name. In the UK fuzziness can be introduced by consideration of the first part of the post code - which determines a wide area. I'm not clear whether the same applies to zip codes.

However this approach does not deal well with addresses that are out of the normal run - my own address is an example; I live on a boat, and as a consequence have some additional pieces of address in order to ensure correct addressing.

Anomalies of this sort are always likely to need manual intervention.

Incidentally, your assertion that it's a no-brainer to merge/rollup people whose First-Last is the same at the same address needs to be challenged. The most difficult cases we had in data cleansing were precisely where there were two people (eg father and son) of the same name living at the same address. Equally, if somebody of the same name bought a property (which happens) then again there are problems of "re-duplication".

Chris Walton
Considering the birth date may be helpful
Agos
You mentioned problems with esq, etc. Part of the rule base we had to build up consisted of rules about referring to a person. This part of the rule base ran to about 180 rules dealing with titles (Dr, Rev, etc), multiply hyphenated names (ffoulkes-symthe), qualifications (CPA, BSc, etc), and familial qualifications (the third, etc). This was in addition to multiple modes of naming depending on the persons native culture (people from the Indian subcontinent often do not have a family name, and each will apparently have a different surname).
Chris Walton
@Agos - birth date is indeed helpful, but can lead to privacy issues, and in a lot of circumstances may not be recorded (in much of Europe anyway).
Chris Walton
Yes, Chris, in the USA the Left3 of our Zip Code IDs a wider area. Good point on the father/son unintentional merge. In my data set, all of the person entities are officers/owners of corporations which lends a greater degree of uniqueness to otherwise common names (especially when paired with geo data).
Chris Adragna
@Agos, we don't have birthdate unfortunately. It sure would be easier with it -- even just part of it. :)
Chris Adragna