ansaurus

Question

What logic to use to rollup/merge multiple person entities as the same? (tight, but fuzzy enough to broaden matches)

Answer 1

A:

I was involved in a project to clean up name and address data for a large financial institution. We achieved a success rate automatically of about 98.4% but unfortunately this still left about 150,000 mismatches.

The way we attacked the problem was to (over time) build up a rule base of the types of errors that could occur, and extending the fuzziness of the logic to cover identified classes of error.

A significant amount of data cleansing can indeed be done by reference to (UK) post codes and house number and/or name. In the UK fuzziness can be introduced by consideration of the first part of the post code - which determines a wide area. I'm not clear whether the same applies to zip codes.

However this approach does not deal well with addresses that are out of the normal run - my own address is an example; I live on a boat, and as a consequence have some additional pieces of address in order to ensure correct addressing.

Anomalies of this sort are always likely to need manual intervention.

Incidentally, your assertion that it's a no-brainer to merge/rollup people whose First-Last is the same at the same address needs to be challenged. The most difficult cases we had in data cleansing were precisely where there were two people (eg father and son) of the same name living at the same address. Equally, if somebody of the same name bought a property (which happens) then again there are problems of "re-duplication".

Chris Walton 2010-10-19 15:59:32

Considering the birth date may be helpful

Agos 2010-10-19 16:08:36

You mentioned problems with esq, etc. Part of the rule base we had to build up consisted of rules about referring to a person. This part of the rule base ran to about 180 rules dealing with titles (Dr, Rev, etc), multiply hyphenated names (ffoulkes-symthe), qualifications (CPA, BSc, etc), and familial qualifications (the third, etc). This was in addition to multiple modes of naming depending on the persons native culture (people from the Indian subcontinent often do not have a family name, and each will apparently have a different surname).

Chris Walton 2010-10-19 16:09:46

@Agos - birth date is indeed helpful, but can lead to privacy issues, and in a lot of circumstances may not be recorded (in much of Europe anyway).

Chris Walton 2010-10-19 16:17:26

Yes, Chris, in the USA the Left3 of our Zip Code IDs a wider area. Good point on the father/son unintentional merge. In my data set, all of the person entities are officers/owners of corporations which lends a greater degree of uniqueness to otherwise common names (especially when paired with geo data).

Chris Adragna 2010-10-19 16:25:35

@Agos, we don't have birthdate unfortunately. It sure would be easier with it -- even just part of it. :)

Chris Adragna 2010-10-19 16:27:08

ansaurus

tags:

views:

answers:

What logic to use to rollup/merge multiple person entities as the same? (tight, but fuzzy enough to broaden matches)

related questions