views:

2317

answers:

7

I am doing a CSV Import tool for the project I'm working on. The client needs to be able to enter the data in excel, export them as CSV and upload them to the database. For example I have this CSV record:

   1,   John Doe,     ACME Comapny   (the typo is on purpose)

Of course, the companies are kept in a separate table and linked with a foreign key, so I need to discover the correct company ID before inserting. I plan to do this by comparing the company names in the database with the company names in the CSV. the comparison should return 0 if the strings are exactly the same, and return some value that gets bigger as the strings get more different, but strcmp doesn't cut it here because:

"Acme Company" and "Acme Comapny" should have a very small difference index, but "Acme Company" and "Cmea Mpnyaco" should have a very big difference index Or "Acme Company" and "Acme Comp." should also have a small difference index, even though the character count is different. Also, "Acme Company" and "Company Acme" should return 0.

So if the client makes a type while entering data, i could prompt him to choose the name he most probably wanted to insert.

Is there a known algorithm to do this, or maybe we can invent one :) ?

+7  A: 

You might want to check out the Levenshtein Distance algorithm as a starting point. It will rate the "distance" between two words.

This SO thread on implementing a Google-style "Do you mean...?" system may provide some ideas as well.

MattK
you beat me to it :)
Neil Aitken
That was very useful. I see PHP even has a levenshtein() function. thanks.
Discodancer
I found a levensthein function for mySQL as well, a quick google should find it.
Neil Aitken
+2  A: 

I've had some success with the Levenshtein Distance algorithm, there is also Soundex.

What language are you implementing this in? we may be able to point to specific examples

Neil Aitken
A: 

There's multiple algorithms to do just that, and most databases even include one by default. It is actually a quite common concern.

If its just about English words, SQL Server for example includes SOUNDEX which can be used to compare on the resulting sound of the word.

http://msdn.microsoft.com/en-us/library/aa259235%28SQL.80%29.aspx

Loki
+4  A: 

I don't know what language you're coding in, but if it's PHP, you should consider the following algorithms:

levenshtein(): Returns the minimal number of characters you have to replace, insert or delete to transform one string into another.
soundex(): Returns the four-character soundex key of a word, which should be the same as the key for any similar-sounding word.
metaphone(): Similar to soundex, and possibly more effective for you. It's more accurate than soundex() as it knows the basic rules of English pronunciation. The metaphone generated keys are of variable length.
similar_text(): Similar to levenshtein(), but it can return a percent value instead.

Phantom Watson
I've checked all of these functions you recommended, but I find only levenshtein useful, as I don't compare for how they sound, rather for typing errors and abbreviations.similar_text() sounded promising, but it return 58% on similar_text('Acme Company', 'Company Acme'), and I need 100% here :)
Discodancer
I might be getting ridiculous here, and this would be computationally slow, but you could use levenshtein() to compare each word from one query to each word in the other, counting only the closest matches as being the "intended word".
Phantom Watson
Oh, that's basically what you're already doing. Nevermind. :-J
Phantom Watson
Soundex is more useful on names
Hank
+2  A: 

I have actually implemented a similar system. I used the Levenshtein distance (as other posters already suggested), with some modifications. The problem with unmodified edit distance (applied to whole strings) is that it is sensitive to word reordering, so "Acme Digital Incorporated World Company" will match poorly against "Digital Incorporated World Company Acme" and such reorderings were quite common in my data.

I modified it so that if the edit distance of whole strings was too big, the algorithm fell back to matching words against each other to find a good word-to-word match (quadratic cost, but there was a cutoff if there were too many words, so it worked OK).

Rafał Dowgird
I like this approach. Very clever.
Phantom Watson
A: 

I'm implementing it in PHP, and I am now writing a piece of code that will break up 2 strings in words and compare each of the words from the first string with the words of the second string using levenshtein and accept the lowes possible values. Ill post it when I'm done.

Thanks a lot.

Update: Here's what I've come up with:

function myLevenshtein( $str1, $str2 )
{
  // prepare the words
  $words1 = explode( " ",  preg_replace( "/\s+/", " ", trim($str1) ) );
  $words2 = explode( " ",  preg_replace( "/\s+/", " ", trim($str2) ) );

  $found = array(); // array that keeps the best matched words so we don't check them again
  $score = 0;       // total score
  // In my case, strings that have different amount of words can be good matches too
  // For example, Acme Company and International Acme Company Ltd. are the same thing
  // I will just add the wordcount differencre to the total score, and weigh it more later if needed
  $wordDiff = count( $words1 ) - count( $words2 );
  foreach( $words1 as $word1 )
  {
    $minlevWord = "";
    $minlev = 1000;
    $return = 0;
    foreach( $words2 as $word2 )
    {
      $return = 1;
      if( in_array( $word2, $found ) )
        continue;
      $lev = levenshtein( $word1, $word2 );
      if( $lev < $minlev )
      {
        $minlev = $lev;
        $minlevWord = $word2;
      }
    }
    if( !$return )
      break;
    $score += $minlev;
    array_push( $found, $minlevWord );
  }

  return $score + $wordDiff;
}
Discodancer
+1  A: 

I've taken SoundEx, Levenshtein, PHP similarity, and double metaphone and packaged them up in C# in one set of extension methods on String.

Entire blog post here.

plinth