views:

522

answers:

5

Are there any tools or methods that can be used for matching by a person's name between two different data sources?

The systems have no other common information and the names have been entered differently in many cases.

Examples of non-exact matches:

King Jr., Martin Luther = King, Martin (exclude suffix)
Erving, Dr. J. = Erving, J. (exclude prefix)
Obama, Barak Hussein = Obama, Barak (exclude middle name)
Pufnstuf, H.R. = Pufnstuf, Haibane Renmei (match abbreviations)
Tankengine, Thomas = Tankengine, Tom (match common nicknames)
Flair, Rick "the Natureboy" = Flair, Natureboy (match on nickname)

+1  A: 

I often employ soundex-type algorithms for this type of situation. Try the Double Metaphone algorithm. If you are using SQL Server, there is some source code to create a user defined function.

Because you have transposed data, you may want to normalize it a bit, e.g., remove all commas and the sort resulting words by first letter. That will give you some better matching potential. In the case where words have been added in the middle, it gets a bit tougher. You could consider breaking a name into words, checking with Double Metaphone whether there is a word in the other column that matches, and then collect the overall count of matches vs. words, which will tell you how close the two columns are.

I would also filter out common words like Dr., Mr., Ms., Mrs., etc., before doing the comparisons.

RedFilter
+1  A: 

Here are some options:

Phonetic algorithms...

Soundex (http://en.wikipedia.org/wiki/Soundex)

Double Metaphone (http://en.wikipedia.org/wiki/Double_Metaphone)

Edit Distance (http://en.wikipedia.org/wiki/Levenshtein_distance)

Jaro-Winkler Distance (http://en.wikipedia.org/wiki/Jaro-Winkler_distance)

Another thing you could try would be to compare each word (splitting on space and maybe hyphen) with each word in the other name and see how many words match up. Maybe combine this with phonetic algorithms for more fuzzy matching. For a huge data set, you would want to index each word and match it with a name id. For abbreviation matching you could compare just the first letter. You probably want to ignore anything but letters when you compare words as well.

Many of the phonetic algorithms have open source / samples online.

John JJ Curtis
+3  A: 

I had to use a variety of techniques suggested. Thanks pointing me in the right direction(s). Hopefully, the following will help someone else out with this type of problem to solve.

Removing excess characters

CREATE FUNCTION [dbo].[fn_StripCharacters]
(
    @String NVARCHAR(MAX), 
    @MatchExpression VARCHAR(255)
)
RETURNS NVARCHAR(MAX)
AS
BEGIN
    SET @MatchExpression =  '%['+@MatchExpression+']%'

    WHILE PatIndex(@MatchExpression, @String) > 0
        SET @String = Stuff(@String, PatIndex(@MatchExpression, @String), 1, '')

    RETURN @String

END

Usage:

--remove all non-alphanumeric and non-white space  
dbo.fn_StripCharacters(@Value, , '^a-z^0-9 ')

Split name into parts

CREATE FUNCTION [dbo].[SplitTable] (@sep char(1), @sList StringList READONLY)
RETURNS @ResultList TABLE 
    (
     [ID] VARCHAR(MAX),
     [Val] VARCHAR(MAX)
    )
AS
BEGIN

declare @OuterCursor cursor
declare @ID varchar(max)
declare @Val varchar(max)

set @OuterCursor = cursor fast_forward for (SELECT * FROM @sList) FOR READ ONLY

open @OuterCursor

fetch next from @OuterCursor into @ID, @Val

while (@@FETCH_STATUS=0)
begin

    INSERT INTO @ResultList (ID, Val) 
    select @ID, split.s from dbo.Split(@sep, @Val) as split 
           where len(split.s) > 0

    fetch next from @OuterCursor into @ID, @Val
end

close @OuterCursor
deallocate @OuterCursor 

CREATE FUNCTION [dbo].[Split] (@sep char(1), @s varchar(8000))
RETURNS table
AS
RETURN (
    WITH Pieces(pn, start, stop) AS (
      SELECT 1, 1, CHARINDEX(@sep, @s)
      UNION ALL
      SELECT pn + 1, stop + 1, CHARINDEX(@sep, @s, stop + 1)
      FROM Pieces
      WHERE stop > 0
    )
    SELECT pn,
      LTRIM(RTRIM(SUBSTRING(@s, start, 
             CASE WHEN stop > 0 
                  THEN stop-start 
                  ELSE 8000 
             END))) AS s
    FROM Pieces
  )

RETURN

Usage:

--create split name list
DECLARE @NameList StringList 

INSERT INTO @NameList (ID, Val)
SELECT id, firstname FROM dbo.[User] u
WHERE PATINDEX('%[^a-z]%', u.FirstName) > 0 

----remove split dups
select u.ID, COUNT(*)
from dbo.import_SplitTable(' ', @NameList) splitList
INNER JOIN dbo.[User] u
ON splitList.id = u.id

Common nicknames:

I created a table based on this list and used it to join on common name equivalents.

Usage:

SELECT u.id
, u.FirstName
, u_nickname_maybe.Name AS MaybeNickname
, u.LastName
, c.ID AS ContactID from
FROM dbo.[User] u 
INNER JOIN nickname u_nickname_match
ON u.FirstName = u_nickname_match.Name
INNER JOIN nickname u_nickname_maybe
ON u_nickname_match.relatedid = u_nickname_maybe.id
LEFT OUTER JOIN
(
    SELECT c.id, c.LastName, c.FirstName, 
         c_nickname_maybe.Name AS MaybeFirstName
    FROM dbo.Contact c
    INNER JOIN nickname c_nickname_match
    ON c.FirstName = c_nickname_match.Name
    INNER JOIN nickname c_nickname_maybe
    ON c_nickname_match.relatedid = c_nickname_maybe.id
    WHERE c_nickname_match.Name <> c_nickname_maybe.Name
) as c
ON c.AccountHolderID = ah.ID 
       AND u_nickname_maybe.Name = c.MaybeFirstName AND u.LastName = c.LastName
WHERE u_nickname_match.Name <> u_nickname_maybe.Name

Phonetic algorithms (Jaro Winkler):

The amazing article, Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, shows how to install and use the SimMetrics library into SQL Server. This library lets you find relative similarity between strings and includes numerous algorithms. I ended up mostly using Jaro Winkler to match the names.

Usage:

SELECT
u.id AS UserID
,c.id AS ContactID
,u.FirstName
,c.FirstName 
,u.LastName
,c.LastName
,maxResult.CombinedScores
 from
(
    SELECT
      u.ID
    , 
     max(
      dbo.JaroWinkler(lower(u.FirstName), lower(c.FirstName))  
      * dbo.JaroWinkler(LOWER(u.LastName), LOWER(c.LastName))
     ) AS CombinedScores
    FROM dbo.[User] u, dbo.[Contact] c
    WHERE u.ContactID IS NULL
    GROUP BY u.id
) AS maxResult
INNER JOIN dbo.[User] u
ON maxResult.id  = u.id
INNER JOIN dbo.[Contact] c
ON maxResult.CombinedScores = 
dbo.JaroWinkler(lower(u.FirstName), lower(c.FirstName)) 
* dbo.JaroWinkler(LOWER(u.LastName), LOWER(c.LastName))
Even Mien
This must be the most comprehensive answer on SO - it's a problem I've solved in code but never in pure Sql - very elegant.
MrTelly
@MrTelly Thanks. I did have to resort to the CLR for the SimMetrics Library, though. Hopefully, this will save someone else some pain on their conversion project.
Even Mien
A: 

thanks so much this is enormously helpful.

Angela
+1  A: 

It's a very complex problem - and there are a lot of expensive tools to do it correctly.
If you ever wondered why you can't check in on a flight as Tom, Dick or Harry (or Bill)
Or why no-fly lists and terrorists watch lists don't work -consider:

(1) Muammar Qaddafi
(2) Mo'ammar Gadhafi
(3) Muammar Kaddafi
(4) Muammar Qadhafi
(5) Moammar El Kadhafi
(6) Muammar Gadafi
(7) Mu'ammar al-Qadafi
(8) Moamer El Kazzafi
(9) Moamar al-Gaddafi
(10) Mu'ammar Al Qathafi
(11) Muammar Al Qathafi
(12) Mo'ammar el-Gadhafi
(13) Moamar El Kadhafi
(14) Muammar al-Qadhafi
(15) Mu'ammar al-Qadhdhafi
(16) Mu'ammar Qadafi
(17) Moamar Gaddafi
(18) Mu'ammar Qadhdhafi
(19) Muammar Khaddafi
(20) Muammar al-Khaddafi
(21) Mu'amar al-Kadafi
(22) Muammar Ghaddafy
(23) Muammar Ghadafi
(24) Muammar Ghaddafi
(25) Muamar Kaddafi
(26) Muammar Quathafi
(27) Muammar Gheddafi
(28) Muamar Al-Kaddafi
(29) Moammar Khadafy
(30) Moammar Qudhafi
(31) Mu'ammar al-Qaddafi
(32) Mulazim Awwal Mu'ammar Muhammad Abu Minyar al-Qadhafi

And that's just official spellings - it doesn't include typos!

Martin Beckett