views:

118

answers:

2

Hi all,

I am currently working on a project where I a data matching algorithm needs to be implemented. An external system passes in all data it knows about a customer, and the system I design has to return the customer matched. So the external system then knows the correct id of the customer plus it gets additional data or can update its own data of the specific customer.

The following fields are passed in:

  • Name
  • Name2
  • Street
  • City
  • ZipCode
  • BankAccountNumber
  • BankName
  • BankCode
  • Email
  • Phone
  • Fax
  • Web

The data can be of high quality and alot of information is available, but often the data is crappy and just the name and address is available and might have spellings.

I'm implementing the project in .Net. What I currently do is something like the following:

public bool IsMatch(Customer customer)
{
    // CanIdentify just checks if the info is provided and has a specific length (e.g. > 1)
    if (CanIdentifyByStreet() && CanIdentifyByBankAccountNumber())
    {
        // some parsing of strings done before (substring, etc.)
        if(Street == customer.Street && AccountNumber == customer.BankAccountNumber) return true;
    }
    if (CanIdentifyByStreet() && CanIdentifyByZipCode() &&CanIdentifyByName())
    {
        ...
    }
}

I am not very happy with the approach above. This is because I would have to write if statements for all reasonable cases (combinations) so I don't miss any chance of matching the entity.

So I thought maybe I could create some kind of matching score. So for each criteria matched, a score would be added. Like:

public bool IsMatch(Customer customer)
{
    int matchingScore = 0;
    if (CanIdentifyByStreet())
    {
        if(....)
            matchingScore += 10;
    }
    if (CanIdentifyByName())
    {
        if(....)
            matchingScore += 10;
    }
    if (CanIdentifyBankAccountNumber())
    {
        if(....)
            matchingScore += 10;
    }

    if(matchingScore > iDontKnow)
        return true;
}

This would allow me to take in consideration all matching data, and depending on some weight I would increase the matching score. If the score is high enough, it's a match.

Know my question is: Are there any best practices out there for such things, like matching algorithm patterns etc? Thanks alot!

+1  A: 

For inspiration, look at the Levenshtein distance algorithm. This will give you a reasonable mechanism to weight your comparisons.

I would also add that in my experience you can never match two arbitrary pieces of data into the same entity with absolute certainty. You need to present plausible matches to a user, who can then verify for sure that John Smith on 1920 E. Pine is the same person as Jon Smith on 192 East Pine Road or not.

Matthew Vines
Levenshtein and hamming would be overkill.
Rook
Maybe, I'm not sure exactly what his requirements are. The matchingScore variable in his proposed solution led me to believe that he required some weighting system for matches, and was unsure how to proceed.
Matthew Vines
@Matthew Vines +1; you could then store the difference of the n-th variable into the n-th element of a vector and then compute the euclidean distance of two domain objects to get the match (I guess you will need to multiple some constants before each variable to weight them)@The Rook why? look at the wikipedia pseudo code.
Karussell
A: 

In my experience with this sort of thing, it was actually the business people who defined the rules of what was acceptible as a match, rather than it being a technical decision. This has made sense to me, since the business ends up assuming the risk. Also, what constitutes a match can be prone to change, like if they use the system and find that too many people are being excluded.

I think that your first approach makes more sense, in that if you can match someone by name and bank account number, then you're pretty sure it's them. However, if both the name and bank info don't match, but the address, phone, and all that matches (ie. a spouse) then the scoring system might incorrectly match people. I realize it's a lot of code, but so long as you extract out the actual matching code (matchPhoneNumber method, etc), then it's fine design-wise.

I would probably take it a step further and pull out the matching into an enum and then have lists of acceptable matches. Sort of like this: interface Match { boolean matches(Customer c1, Customer c2); }

class BankAccountMatch implements Match
{
    public boolean matches(Customer c1, Customer c2)
    {
        return c1.getBankAccountNumber() == c2.getBankAccountNumber();
    }
}

static Match BANK_ACCOUNT_MATCH = new BankAccountMatch();

Match[][] validMatches = new Match[] [] {
        {BANK_ACCOUNT_MATCH, NAME_MATCH},
        {NAME_MATCH, ADDRESS_MATCH, FAX_MATCH}, ...
};

And then the code that does the validation would just iterate over the validMatches array and test them to see if one fits. I might even pull out the lists of valid matches into a config file. That all depends on the level of robustness your system needs though.

Amber Shah