tags:

views:

257

answers:

5

I would like to know if there is a way to match people between two separate systems, using (mostly) SQL.

We have two separate Oracle databases where people are stored. There is no link between the two (i.e. cannot join on person_id); this is intentional. I would like to create a query that checks to see if a given group of people from system A exists in system B.

I am able to create tables if that makes it easier. I can also run queries and do some data manipulation in Excel when creating my final report. I am not very familiar with PL/SQL.

In system A, we have information about people (name, DOB, soc, sex, etc.). In system B we have the same types of information about people. There could be data entry errors (person enters an incorrect spelling), but I am not going to worry about this too much, other than maybe just comparing the first 4 letters. This question deals with that problem more specifically.

They way I thought about doing this is through correlated subqueries. So, roughly,

select a.lastname, a.firstname, a.soc, a.dob, a.gender
  case 
    when exists (select 1 from b where b.lastname = a.lastname) then 'Y' else 'N'
  end last_name,
  case 
    when exists (select 1 from b where b.firstname = a.firstname) then 'Y' else 'N'
  end first_name, 
  case [etc.]
from a

This gives me what I want, I think...I can export the results to Excel and then find records that have 3 or more matches. I believe that this shows that a given field from A was found in B. However, I ran this query with just three of these fields and it took over 3 hours to run (I'm looking in 2 years of data). I would like to be able to match on up to 5 criteria (lastname, firstname, gender, date of birth, soc). Additionally, while soc number is the best choice for matching, it is also the piece of data that tends to be missing the most often. What is the best way to do this? Thanks.

A: 

Are there indexes on all of the columns in table b in the WHERE clause? If not, that will force a full scan of the table for each row in table a.

DCookie
a) no WHERE clause b) I understand that the way I have it now is very slow, so was looking for other suggestions. I'm very new here and not entirely familiar with the database design. Thanks!
Julie
Yes, there is a WHERE clause in each of your CASE expression SELECT statements.
DCookie
b) If you index your columns in table b appropriately, the query should speed up considerably.
DCookie
+2  A: 

I would probably use joins instead of correlated subquires but you will have to join on all the fields, so not sure how much that might improve things. But since correlated subqueries often have to evaluate row-by-row and joins don;t it could improve things a good bit if you have good indexing. But as with all performance tuning only trying the techinque will let you knw ofor sure.

I did a similar task looking for duplicates in our SQl server system and I broke it out into steps. So first I found everyone where the names and city/state were an exact match. Then I looked for additional possible matches (phone number, ssn, inexact name match etc. AS I found a possible match between two profiles, I added it to a staging table with a code for what type of match found it. Then I assigned a confidence amount to each type of match and added up the confidence for each potential match. So if the SOC matches, you might want a high confidence, same if the name is eact and the gender is exact and the dob is exact. Less so if the last name is exact and the first name is not exact ,etc. By adding a confidence, I was much better able to see which possible mathes were more likely to be the same person. SQl Server also has a soundex function whic can help with names that are slightly different. I'll bet Oracle has something similar.

After I did this, I learned how to do fuzzy grouping in SSIS and was able to generate more matches with a higher confidence level. I don't know if Oracle's ETL tools havea a way to do fuzzy logic, but if they do it can really help with this type of task. If you happen to also have SQL Server, SSIS can be run connecting to Oracle, so you could use fuzzy grouping yourself. It can take a long time to run though.

I will warn you that name, dob and gender are not likely to ensure they are the same person especially for common names.

HLGEM
"I would probably use joins instead of correlated subquires" - hm. I'm sort of new to this, but I don't think we can join the two tables. There is a system requirement that the person ID _cannot_ link the two systems. The only other unique id that I think could be used is soc, but like I said, it is missing a lot."So first I found everyone where the names and city/state were an exact match." - Can you explain how you did this?"I will warn you..." This is exactly the problem we are having, due to bad data. The people requesting the report are aware of this.Thank you!
Julie
Oracle 8i+ supports SOUNDEX: http://techonthenet.com/oracle/functions/soundex.php
OMG Ponies
YOu don't join onthe personid, you join onthe fileds you are trying to match. Select * from table1 t1 join table2 t2 on t1.firstname = t2.firstname and t1.lastname = t2.lastname and t1.DOB = t2.DOB
HLGEM
+1  A: 

You definitely want to weigh the different matches. If an SSN matches, that's a pretty good indication. If a firstName matches, that's basically worthless.

You could try a scoring method based on weights for the matches, combined with the phonetic string matching algorithms you linked to. Here's an example I whipped up in T-SQL. It would have to be ported to Oracle for your issue.

--Score Threshold to be returned
DECLARE @Threshold DECIMAL(5,5) = 0.60

--Weights to apply to each column match (0.00 - 1.00)
DECLARE @Weight_FirstName DECIMAL(5,5) = 0.10
DECLARE @Weight_LastName DECIMAL(5,5) = 0.40
DECLARE @Weight_SSN DECIMAL(5,5) = 0.40
DECLARE @Weight_Gender DECIMAL(5,5) = 0.10

DECLARE @NewStuff TABLE (ID INT IDENTITY PRIMARY KEY, FirstName VARCHAR(MAX), LastName VARCHAR(MAX), SSN VARCHAR(11), Gender VARCHAR(1))
INSERT INTO @NewStuff
    ( FirstName, LastName, SSN, Gender )
VALUES  
    ( 'Ben','Sanders','234-62-3442','M' )

DECLARE @OldStuff TABLE (ID INT IDENTITY PRIMARY KEY, FirstName VARCHAR(MAX), LastName VARCHAR(MAX), SSN VARCHAR(11), Gender VARCHAR(1))
INSERT INTO @OldStuff
    ( FirstName, LastName, SSN, Gender )
VALUES
    ( 'Ben','Stickler','234-62-3442','M' ), --3/4 Match
    ( 'Albert','Sanders','523-42-3441','M' ), --2/4 Match
    ( 'Benne','Sanders','234-53-2334','F' ), --2/4 Match
    ( 'Ben','Sanders','234623442','M' ), --SSN has no dashes
    ( 'Ben','Sanders','234-62-3442','M' ) --perfect match

SELECT 
    'NewID' = ns.ID,
    'OldID' = os.ID,

    'Weighted Score' = 
        (CASE WHEN ns.FirstName = os.FirstName THEN @Weight_FirstName ELSE 0 END)
        +
        (CASE WHEN ns.LastName = os.LastName THEN @Weight_LastName ELSE 0 END)
        +
        (CASE WHEN ns.SSN = os.SSN THEN @Weight_SSN ELSE 0 END)
        +
        (CASE WHEN ns.Gender = os.Gender THEN @Weight_Gender ELSE 0 END)
    ,   

    'RAW Score' = CAST(
        ((CASE WHEN ns.FirstName = os.FirstName THEN 1 ELSE 0 END)
        +
        (CASE WHEN ns.LastName = os.LastName THEN 1 ELSE 0 END) 
        +
        (CASE WHEN ns.SSN = os.SSN THEN 1 ELSE 0 END) 
        +
        (CASE WHEN ns.Gender = os.Gender THEN 1 ELSE 0 END) ) AS varchar(MAX))
        + 
        ' / 4',

    os.FirstName ,
    os.LastName ,
    os.SSN ,
    os.Gender

FROM @NewStuff ns

--make sure that at least one item matches exactly
INNER JOIN @OldStuff os ON 
    os.FirstName = ns.FirstName OR
    os.LastName = ns.LastName OR
    os.SSN = ns.SSN OR
    os.Gender = ns.Gender
where 
    (CASE WHEN ns.FirstName = os.FirstName THEN @Weight_FirstName ELSE 0 END)
    +
    (CASE WHEN ns.LastName = os.LastName THEN @Weight_LastName ELSE 0 END)
    +
    (CASE WHEN ns.SSN = os.SSN THEN @Weight_SSN ELSE 0 END)
    +
    (CASE WHEN ns.Gender = os.Gender THEN @Weight_Gender ELSE 0 END)
    >= @Threshold
ORDER BY ns.ID, 'Weighted Score' DESC

And then, here's the output.

NewID OldID Weighted  Raw    First  Last      SSN          Gender
1     5     1.00000   4 / 4  Ben    Sanders   234-62-3442  M
1     1     0.60000   3 / 4  Ben    Stickler  234-62-3442  M
1     4     0.60000   3 / 4  Ben    Sanders   234623442    M

Then, you would have to do some post processing to evaluate the validity of each possible match. If you ever get a 1.00 for weighted score, you can assume that it's the right match, unless you get two of them. If you get a last name and SSN (a combined weight of 0.8 in my example), you can be reasonably certain that it's correct.

EndangeredMassa
That's SQL Server notation (possibly MySQL too), not Oracle...
OMG Ponies
It is, but I imagine it could be ported. I was posting this as a concept.
EndangeredMassa
Thanks, this looks like it will be very helpful. I'm not very familiar with PL/SQL, so it might take me some time to see if I will be able to do this or not. But I understand what's going on and it makes sense. Thanks!
Julie
+2  A: 

Example of HLGEM's JOIN suggestion:

SELECT a.lastname, 
       a.firstname, 
       a.soc, 
       a.dob, 
       a.gender
  FROM TABLE a
  JOIN TABLE b ON SOUNDEX(b.lastname) = SOUNDEX(a.lastname)
              AND SOUNDEX(b.firstname) = SOUNDEX(a.firstname)
              AND b.soc = a.soc
              AND b.dob = a.dob
              AND b.gender = a.gender

Reference: SOUNDEX

OMG Ponies
Thanks. Makes sense, but the data is too messy, nothing would be returned! Thanks for showing use of the soundex function too, I wasn't sure how it was actually used.
Julie
A: 

You can use soundex but you can also use utl_match for fuzzy comparing of string, utl_match makes it possible to define a treshold: http://www.psoug.org/reference/utl%5Fmatch.html

tuinstoel