views:

207

answers:

3

I'm converting several large ETL processes from SSIS over to Rhino-Etl. I got fed up with the "grahical programming" nature of SSIS. Hunting through hundreds of dialog boxes to figure out what the ETL process is doing gets old in a hurry.

Several of the SSIS packages use the Fuzzy Match component for joining user accounts based on first name, last name, and DOB. The fuzzy match is necessary because the names are not always spelled the same across the two systems. There might be a legal name in one and a nick name in the other, typos, marriage name changes, etc.

The SSIS fuzzy lookup component does an OK job of handling these. It's not perfect, but it still greatly reduces the number of records that require a human's help.

Neither .Net nor Rhino-Etl have anything like this out of the box, so I'm looking for suggestions.

I have tried using the SoundEx t-sql function, but it doesn't produce very good results.

I'm currently considering using Levenshtein or possibly a Lucene.Net index.

Does anyone have any advice on either of those? Please feel free to suggest something completely different as well.

A: 

We use Levenshtein to find similar strings in our application code (actually a database that contains all the strings). It works quite good, provided that you play a bit with the distance, e.g. allow longer strings to have more differences.

What Levenshtein doesn't do is handle synonyms, like "child" and "kid" or "car" and "wagon". In your case you probably also want to handle synonyms like "Bill" and "William", so you will probably have to add this yourself.

Patrick
A: 

Have a look at www.match-logics.com It offers pure .NET solution with c# API to make fuzzy search requests in SQL Server 2005/2008.

paul
+1  A: 

Have a look at open source project Sam's String Metrics http://www.dcs.shef.ac.uk/~sam/stringmetrics.html.

I answered a similar question here: http://stackoverflow.com/questions/83777/are-there-any-fuzavvvy-search-or-string-similarity-functions-libraries-written-for-c/1095806#1095806

I would also do some research on Double Metaphone http://en.wikipedia.org/wiki/Double_Metaphone

Zaffiro
I haven't actually dug into this library, but it seems like it would meet the requirement. It's unfortunately written in Java, not .Net.
Craig Quillen
There is a .net version. Its on the download page
Zaffiro