views:

433

answers:

3

I'm wondering if there is some kind of way to do fuzzy string matching in PHP. Looking for a word in a long string, finding a potential match even if its mis-spelled; something that would find it if it was off by one character due to an OCR error.

I was thinking a regex generator might be able to do it. So given an input of "crazy" it would generate this regex:

.*((crazy)|(.+razy)|(c.+azy)|cr.+zy)|(cra.+y)|(craz.+)).*

It would then return all matches for that word or variations of that word.

How to build the generator: I would probably split the search string/word up into an array of characters and build the regex expression doing a foreach the newly created array replacing the key value (the position of the letter in the string) with ".+".

Is this a good way to do fuzzy text search or is there a better way? What about some kind of string comparison that gives me a score based on how close it is? I'm trying to see if some badly converted OCR text contains a word in short.

A: 

Levenshtein is one example of a String Edit-distance. There are different metrics for different purposes. Familiarize yourself with them and find the one that works for you.

Visionary Software Solutions
A: 
echo generateRegex("crazy");
function generateRegex($word)
{
  $len = strlen($word);
  $regex = "\b((".$word.")";
  for($i = 0; $i < $len; $i++)
  {
    $temp = $word;
    $temp[i] = '.';
    $regex .= "|(".$temp.")";
  }
  $regex = $regex.")\b";
  return $regex;
}
Amarghosh
OP asked for a PHP-based solution.
Ben Dunlap
Agreed. But how hard it is to translate this?
Amarghosh
translated to php - click on the viewsource link in the diff to see java version.
Amarghosh
+3  A: 

String distance functions are useless when you don't know what the right word is. I'd suggest pspell functions:

$p = pspell_new("en");
print_r(pspell_suggest($p, "crazzy"));

http://www.php.net/manual/en/function.pspell-suggest.php

stereofrog
Ah, it shows that I'm a PHP novice: I never heard of that method! Excellent suggestion, this should be voted up. +1
Bart Kiers