Hi all,
I am currently working on a project where I a data matching algorithm needs to be implemented. An external system passes in all data it knows about a customer, and the system I design has to return the customer matched. So the external system then knows the correct id of the customer plus it gets additional data or can update its own data of the specific customer.
The following fields are passed in:
- Name
- Name2
- Street
- City
- ZipCode
- BankAccountNumber
- BankName
- BankCode
- Phone
- Fax
- Web
The data can be of high quality and alot of information is available, but often the data is crappy and just the name and address is available and might have spellings.
I'm implementing the project in .Net. What I currently do is something like the following:
public bool IsMatch(Customer customer)
{
// CanIdentify just checks if the info is provided and has a specific length (e.g. > 1)
if (CanIdentifyByStreet() && CanIdentifyByBankAccountNumber())
{
// some parsing of strings done before (substring, etc.)
if(Street == customer.Street && AccountNumber == customer.BankAccountNumber) return true;
}
if (CanIdentifyByStreet() && CanIdentifyByZipCode() &&CanIdentifyByName())
{
...
}
}
I am not very happy with the approach above. This is because I would have to write if statements for all reasonable cases (combinations) so I don't miss any chance of matching the entity.
So I thought maybe I could create some kind of matching score. So for each criteria matched, a score would be added. Like:
public bool IsMatch(Customer customer)
{
int matchingScore = 0;
if (CanIdentifyByStreet())
{
if(....)
matchingScore += 10;
}
if (CanIdentifyByName())
{
if(....)
matchingScore += 10;
}
if (CanIdentifyBankAccountNumber())
{
if(....)
matchingScore += 10;
}
if(matchingScore > iDontKnow)
return true;
}
This would allow me to take in consideration all matching data, and depending on some weight I would increase the matching score. If the score is high enough, it's a match.
Know my question is: Are there any best practices out there for such things, like matching algorithm patterns etc? Thanks alot!