views:

1366

answers:

8

Here's an interesting problem.

I have an oracle database with name & address information which needs to be kept current.

We get data feeds from a number of different gov't sources, and need to figure out matches, and whether or not to update the db with the data, or if a new record needs to be created.

There isn't any sort of unique identifier that can be used to tie records together, and the data quality isn't always that good - there will always be typos, people using different names (i.e. Joe vs. Joseph), etc.

Are there any tools to help with data matching? Because of privacy regulations, we can't send the data out, so it's something we have to do in-house.

If you can't recommend tool, I'd be interested in hearing from anyone who's worked on this type of problem before as to how they solved it, or at least automated parts of it.

+1  A: 

At my current job, we have a lot of data integrity issues. We try to "scrub" data before it is loaded to ensure quality. We currently use Melissa Data to scrub names, addresses and emails. It does a pretty good job standardizing things before we load it. That way, we can avoid duplicate data.

Also, SQL Server Integration Services (I know you are using Oracle) has a component which does "fuzzy" matching on strings, allowing you to find a near match rather than a straight one-to-one match. It is called Fuzzy Lookup. Basically it gives you a score, telling you how close to a match the two inputs are... You can then set threshholds to tell it how close it has to be to consider it a match (i.e. within 80% or whatever).

In the past, I have also used SOUNDEX to determine if names were similar in determining duplicates. There are lots of SOUNDEX implementations.

Good luck.

Doanair
soundex is very bad; if you want to go phonetic path you should do something along the lines of double metaphone...
Unreason
+1  A: 

The term for the problem you're looking to solve is "Record Linking".

I cannot recommend a specific tool. Do a search for "record linking software" or "merge purge software" and with a little research you should be able to find something that meets your needs.

I was able to find one open source solution, named Febrl. This tool was originally developed for the realm of biomedical research, which explains the name "Freely extensible biomedical record linkage." You can read more about it here: "Parallel Large Scale Techniques for High-Performance Record Linkage"

For a good overview of the problems/solutions involved in record linking read about "Merge/Purge and Duplicate Detection".

Robert
+2  A: 

Each of the major software companies that are active in this space offer solution suites that handle name and address parsing, data standardization, record deduplication or matching, record linking/merging, survivorship, and so on. They're all a bit pricey, though.

For example, Oracle's own solution for this problem is the product "Oracle Data Quality (ODQ) for Oracle Data Integrator (ODI)," which is part of their Fusion Middleware stack. As the name implies, ODQ requires ODI (i.e., it is an add-on module that is licensed separately and is dependent on ODI).

IBM's Websphere solution suite (obtained through their Ascential acquisition) includes QualityStage.

Business Objects, now an SAP company, has a Data Quality product under its Enterprise Information Management (EIM) suite.

Other major data quality brands include Dataflux (a SAS company) and Trillium Software (a Harte-Hanks company)

The Gartner Group releases an annual Magic Quadrant for data quality solution suites. Vendors who rate well in these Magic Quadrants usually make the entire report available online to registered users on their website (example 1, example 2).

mdy
A: 

I did something like this on membership enrollment lists for an insurance company. Luckily we had SSN's to determine the primary member, but I had to decide if dependents were new or existing (by name alone). I tried a soundex algorithm as part of a solution, but it didn't seem to fit the bill, because it seemed to map too many different names to the same thing (and many families tend to name their kids with similar sounding names). I ended up just using the (I think) the first four letters of the dependents first name as a unique identifier, and calling it "good enough." I don't know how I'd handle George Foreman's family :-)

runrig
+1  A: 

I've worked on this very problem, and analyzed five vendors before selecting one. Here's what I learned. It was five years ago, so some of these companies have been bought by other companies...

  1. At the top tier, all the solutions are good and work about the same. The top vendors are FirstLogic and Trillium. Both FirstLogic and Trillium sold add-ins that plugged into Informatica as transformations.

  2. Group1 was OK, but their demo was based on a a CICS application and the whole presentation seemed very unpolished.

  3. DataFlux looked really good, but under the covers it wasn't as powerful or full-featured. I expect that they've made the most progress over the last five years, though.

  4. We never seriously considered Oracle because we were ticked off at the last expensive Oracle product we had purchased.

We selected those companies based on Gartner. There were about eight more that we didn't look at. We ended up going with FirstLogic over Trillium because (publicly) they had better Informatica integration. But (privately) I trusted the FirstLogic sales team more.

I'll put my lessons learned in a separate post.

JPLemme
+1  A: 

Lessons Learned from using FirstLogic. (Other products should be similar.) For context, we were getting files from dozens of sources, and each would be in a different format. We needed to know who needed to be added to our database, who needed to be updated, and who just needed to be marked as "on the list".

  1. I expected the software would look at an entire file and determine that--say--if column A had last names in 98 out of a 100 rows then column A must be the last name column. That's not true; each record is processed individually.

  2. If you know exactly what data is in what field then you can tell the software that and it will handle it. But if you're not always sure you're better off just giving it the whole string and letting the software figure it out. An example of this is a last name field that contains something like "Smith Jr MD". If you declare it to be the last name column it will assume the last name is "Smith Jr MD". But if you pass "John Smith Jr MD" and let the software figure it out it will correctly identify all the bits.

  3. Some things that seemed obvious were not. For example, out of the box FirstLogic doesn't assume that a comma means "last name, first name". It actually has a great big list of last names and a great big list of first names, so it figures it can just ignore the comma. We consistently had issues with people like "John, Thomas". And sometimes it would get confused because what looked like an obvious last name to us was actually a first name in Spanish or something.

  4. It's hard to use the system to add dummy data. If you have people named things like "Test Account" or "TBD" they won't be processed at all--FirstLogic will throw the record away because it didn't find any bits of data it recognized.

  5. Customizing the system is possible, but not as easy as the sales people make it sound. There are a million options and custom files and dictionaries. You'll need to invest the time to understand how it works if you expect to be able to customize it. It's like a RDBMS or an ETL tool. It's not a magic black box.

  6. It also has a ton of data quality features that help to justify buying the software but that require a dedicated effort to learn and apply.

  7. The tools aren't really designed to process a list against a master file; they were created for merge-purge operations. It's possible (we're doing it), but it requires that you treat your database a a list (which requires you to extract all the live records into a flat file). Also, the more control you want over which records go where (i.e. if the name is an 80% match and the first two digits of the zip code are identical, then it's probably the same person, etc), the complicated your batch process is going to be.

  8. Finally, we realized that processing "John Smith" and processing "Annabelle DiGiovanni" are very different beasts when you're trying to determine if two people at different addresses are actually the same. You can fine-tune the software so that uncommon names are given more weight on a match than common names, but who has the time? Our matching process is about 80% accurate and we were satisfied with manually processing the remaining 20%.

Don't think of it as a tool to automatically do the matching. Think of it as a tool to allow your human data processors to be more productive. That way you've set it up so that if you hit 80% accuracy it's a resounding success rather than short of the mark.

JPLemme
A: 

I have seen Dataflux in action, and, according to our "star" data analyst, it is the most accurate tool for "clustering" (as Dataflux calls it) that he has found for linking names of peoples and companies. It is pretty expensive, though.

luiscolorado
A: 

It's a classic problem of record linkage. Typos, missing letters, etc. can be handled with probabilistic approach by defining set of rules for blocking.

One software that could do it nicely for you is FRIL. In Java, free and very friendly. There are links to tutorial and videos on the website. It supports DB connectivity, although I'm not sure about the oracle.

RecLink could be another way to go, although you would probably have to work harder to have it up and running since there are no manuals/tutorials (to my knowledge).

From free options, there is also RELAIS.

radek