ansaurus

Question

PHP - smart, error tolerating string comparison

Answer 1

+1 A:

Spelling checkers do something like fuzzy string comparison. Perhaps you can adapt an algorithm based on that reference. Or grab the spell checker guessing code from an open source project like Firefox.

wallyk 2010-02-17 23:26:06

Thank you, but @Pascal MARTIN pointed me better direction :]

Adam Kiss 2010-02-18 09:21:49

Answer 2

+1 A:

Not sure (especially about the accents / special characters stuff, which you might have to deal with first), but for characters that are in the wrong place or missing, the levenshtein function, that calculates Levenshtein distance between two strings, might help you (quoting) :

int levenshtein  ( string $str1  , string $str2  )
int levenshtein  ( string $str1  , string $str2  , int $cost_ins  , int $cost_rep  , int $cost_del  )

The Levenshtein distance is defined as the minimal number of characters you have to replace, insert or delete to transform str1 into str2

Other possibly useful functions could be soundex, similar_text, or metaphone.

And some of the user notes on the manual pages of those functions, especially the manual page of levenshtein might bring you some useful stuff too ;-)

Pascal MARTIN 2010-02-17 23:26:56

accents are not the problem, first thing i'll do is `uppercase` the string and then replace accented characters with it's non-accented version (`ž`=>`z`)

Adam Kiss 2010-02-18 09:17:04

I just might check you, one of the functions will be helpful, i'm 100% sure.

Adam Kiss 2010-02-18 09:21:19

OUt of curiosity, when you say "one of the functions", which one are you actually thinking about ? The levenshtein one, or another one ?

Pascal MARTIN 2010-02-18 11:23:47

I'll probably go with `similar_text` - I need to check name (`<40` characters) against one in database, so efficiency is not really my problem. And `similar_text` returns `% of compatibility`, so I can basically say, that if cleaned names have `85%+` or so match, it's the same :)

Adam Kiss 2010-02-18 11:39:10

OK. Thanks for the information :-)

Pascal MARTIN 2010-02-18 11:50:00

Answer 3

+2 A:

You could transliterate the words to latin characters and use a phonetic algorithm like Soundex to get the essence from your word and compare it to the ones you have. In your case that would be C252 for all of your words except the last one that is C250.

Edit The problem with comparative functions like levenshtein or similar_text is that you need to call them for each pair of input value and possible matching value. That means if you have a database with 1 million entries you will need to call these functions 1 million times.

But functions like soundex or metaphone, that calculate some kind of digest, can help to reduce the number of actual comparisons. If you store the soundex or metaphone value for each known word in your database, you can reduce the number of possible matches very quickly. Later, when the set of possible matching value is reduced, then you can use the comparative functions to get the best match.

Here’s an example:

// building the index that represents your database
$knownWords = array('Čakánka', 'Cakaka');
$index = array();
foreach ($knownWords as $key => $word) {
    $code = soundex(iconv('utf-8', 'us-ascii//TRANSLIT', $word));
    if (!isset($index[$code])) {
        $index[$code] = array();
    }
    $index[$code][] = $key;
}

// test words
$testWords = array('cakanka', 'cákanká', 'ČaKaNKA', 'CAKANKA', 'CAAKNKA', 'CKAANKA', 'cakakNa');
echo '<ul>';
foreach ($testWords as $word) {
    $code = soundex(iconv('utf-8', 'us-ascii//TRANSLIT', $word));
    if (isset($index[$code])) {
        echo '<li> '.$word.' is similar to: ';
        $matches = array();
        foreach ($index[$code] as $key) {
            similar_text(strtolower($word), strtolower($knownWords[$key]), $percentage);
            $matches[$knownWords[$key]] = $percentage;
        }
        arsort($matches);
        echo '<ul>';
        foreach ($matches as $match => $percentage) {
            echo '<li>'.$match.' ('.$percentage.'%)</li>';
        }
        echo '</ul></li>';
    } else {
        echo '<li>no match found for '.$word.'</li>';
    }
}
echo '</ul>';

Gumbo 2010-02-17 23:27:26

This is very interesting, but maybe too vague for my needs. Thank you though.

Adam Kiss 2010-02-18 09:17:41

ansaurus

tags:

views:

answers:

PHP - smart, error tolerating string comparison

related questions