views:

97

answers:

3

I have 2 sources of information for the same data (companies), which I can join together via a unique ID (contract number). The presence of the second, different source, is due to the fact that the 2 sources are updated manually, independently. So what I have is an ID and a company Name in 2 tables.

I need to come up with an algorithm that would compare the Name in the 2 tables for the same ID, and order all the companies by a variable which indicates how different the strings are (to highlight the most different ones, to be placed at the top of the list).

I looked at the simple Levenshtein distance calculation algorithm, but it's at the letter level, so I am still looking for something better.

The reason why Levenshtein doesn't really do the job is this: companies have a name, prefixed or postfixed by the organizational form (LTD, JSC, co. etc). So we may have a lot of JSC "Foo" which will differ a lot from Foo JSC., but what I am really looking for in the database is pairs of different strings like SomeLongCompanyName JSC and JSC OtherName.

Are there any Good ways to do this? (I don't really like the idea of using regex to separate words in each string, then find matches for every word in the other string by using the Levenshtein distance, so I am searching for other ideas)

+1  A: 

Could you filter out (remove) those "common words" (similar to removing stop words for fulltext indexing) and then search on that? If not, could you sort the words alphabetically before comparing?

As an alternative or in addition to the Levenshtein distance, you could use Soundex. It's not terribly good, but it can be used to index the data (which is not possible when using Levenshtein).

Thomas Mueller
The common words are also significant, `JSC` differs from `LTD`, and organizational form may change, although rare. As for Soundex - it can mark 2 whole different words as being equal. Sorting words is possible, though expensive.
Alexander
A: 

Thank you both for ideas. I used 4 indices which are levenshtein distances divided by the sum of the length of both words (relative distances) of the following:

  • Just the 2 strings
  • The string composed of the result after separating the word sequences, eliminating the non-word chars, ordering ascending and joining with space as separator.
  • The string which is contained between quotes (if no such string is present, the original string is taken)
  • The string composed of alphabetically ordered first characters of each word.

each of these in return is an integer value between 1 and 1000. The resulting value is the product of:
X1^E1 * X2^E2 * X3^E3 * X4^E4
Where X1..X4 are the indices, and E1..E4 are user-provided preferences of valuable (significant) is each index. To keep the result inside reasonable values of 1..1000, the vector (E1..E4) is normalized.

The results are impressive. The whole thing works much faster than I've expected (built it as a CLR assembly in C# for Microsoft SQL Server 2008). After picking E1..E4 correctly, the largest index (biggest difference) on non-null values in the whole database is 765. Right untill about 300 there is virtually no matching company name. Around 200 there are companies that have kind of similar names, and some are the same names but written in very different ways, with abbreviations, additional words, etc. When it comes down to 100 and less - practically all the records contain names that are the same but written with slight differences, and by 30, only the order or the punctuation may differ.
Totally works, result is better than I've expected.

I wrote a post on my blog, to share this library in case someone else needs it.

Alexander
+1  A: 

How about:
1. Replace all punctuation by whitespace.
2. Break the string up into whitespace-delimited words.
3. Move all words of <= 4 characters to the end, sorted alphabetically.
4. Levenshtein.

TonyK
Your help lead to the solution, might as well mark it as the correct answer. But those who seek the full details and code, look at my answer (I'll update it in just a little bit for full details). Thanks.
Alexander