views:

53

answers:

1

I receive data files from a source I have no control over (the government) and in the records they have a Company Name field that I actually need to associate with existing company records in my database. I'm concerned that some of the names will vary by minor differences such as 'Company X, Inc.' vs 'Company X Inc'.

So my initial thoughts would be to create a collation key field based on the name ToLower() and apply a regex to strip out all white space, and special characters.

Is there any better methodology to apply to this?

+1  A: 

that may work, but there may be false matches, with no way to prevent them, because you have an algorithm solution only. Your best bet is to create an alias table. Include every variation ever found for each company name and a FK to the real company's ID. Include a row for the actual name as well.

AliasID CompanyID CompanyAlias
------- --------- ------------
1       1         Company X, Inc   <<--actual real company name
2       1         Company X Inc
3       1         Company X

If an exact name match is not found in this table when importing data, you can use your proposed algorithm or another, or use a human input, etc to find a match or generate a new company. At that point insert into the alias table. If you find that your match was wrong for some reason, your can alter the alias table to make the proper mapping. If you only go with an algorithm, you'd need to include exceptions and your algorithm would grow large and slow. With this table and a good index, finding your matches should be fast.

KM
Chris Marisic

related questions