ansaurus

Question

Answer 1

+1 A:

For inspiration, look at the Levenshtein distance algorithm. This will give you a reasonable mechanism to weight your comparisons.

I would also add that in my experience you can never match two arbitrary pieces of data into the same entity with absolute certainty. You need to present plausible matches to a user, who can then verify for sure that John Smith on 1920 E. Pine is the same person as Jon Smith on 192 East Pine Road or not.

Matthew Vines 2010-03-12 19:37:18

Levenshtein and hamming would be overkill.

Rook 2010-03-12 19:55:54

Maybe, I'm not sure exactly what his requirements are. The matchingScore variable in his proposed solution led me to believe that he required some weighting system for matches, and was unsure how to proceed.

Matthew Vines 2010-03-12 20:23:11

@Matthew Vines +1; you could then store the difference of the n-th variable into the n-th element of a vector and then compute the euclidean distance of two domain objects to get the match (I guess you will need to multiple some constants before each variable to weight them)@The Rook why? look at the wikipedia pseudo code.

Karussell 2010-03-29 08:57:35

Answer 2

A:

In my experience with this sort of thing, it was actually the business people who defined the rules of what was acceptible as a match, rather than it being a technical decision. This has made sense to me, since the business ends up assuming the risk. Also, what constitutes a match can be prone to change, like if they use the system and find that too many people are being excluded.

I think that your first approach makes more sense, in that if you can match someone by name and bank account number, then you're pretty sure it's them. However, if both the name and bank info don't match, but the address, phone, and all that matches (ie. a spouse) then the scoring system might incorrectly match people. I realize it's a lot of code, but so long as you extract out the actual matching code (matchPhoneNumber method, etc), then it's fine design-wise.

I would probably take it a step further and pull out the matching into an enum and then have lists of acceptable matches. Sort of like this: interface Match { boolean matches(Customer c1, Customer c2); }

class BankAccountMatch implements Match
{
    public boolean matches(Customer c1, Customer c2)
    {
        return c1.getBankAccountNumber() == c2.getBankAccountNumber();
    }
}

static Match BANK_ACCOUNT_MATCH = new BankAccountMatch();

Match[][] validMatches = new Match[] [] {
        {BANK_ACCOUNT_MATCH, NAME_MATCH},
        {NAME_MATCH, ADDRESS_MATCH, FAX_MATCH}, ...
};

And then the code that does the validation would just iterate over the validMatches array and test them to see if one fits. I might even pull out the lists of valid matches into a config file. That all depends on the level of robustness your system needs though.

Amber Shah 2010-03-12 19:51:57

ansaurus

tags:

views:

answers:

Data matching algorithm

related questions