tags:

views:

2549

answers:

5
A: 

This can be very complicated, and I am not personally aware of any good 3rd party libraries although I'm sure they exist. Others may be able to suggest some canned solutions, though.

I have written something similar from scratch a few times in the past. If you go down that route, it is probably not something you'd want to do in PHP by itself as every query would involve getting all of the records and performing your calculations on them. It will almost certainly involve creating a set of index tables that meet your specifications.

For instance, you would have to come up with rules for how you imagine that "Milwaukee" could end up spelled "milwakee." My solution to this was to do vowel compression and duplication compression (not sure if these are actually search terms). So, milwaukee would be indexed as:

  • milwaukee
  • m_lw__k__
  • m_lw_k_

When the search query came in for "milwaukee", I would run the same process on the text input, and then run a search on the index table for:

SELECT cityId,
       COUNT(*)
  FROM myCityIndexTable
 WHERE term IN ('milwaukee', 'm_lw__k__', 'm_lw_k_')

When the search query came in for "milwakee", I would run the same process on the text input, and then run a search on the index table for:

SELECT cityId,
       COUNT(*)
  FROM myCityIndexTable
 WHERE term IN ('milwaukee', 'm_lw_k__', 'm_lw_k_')

In the case of Milwaukee (spelled correctly), it would return "3" for the count.

In the case of Milwakee (spelled incorrectly) ,it would return "2" for the count (since it would not match the m_lw__k__ pattern as it only had one vowel in the middle).

If you sort the results based on the count, you would end up meeting one of your rules, that "Milwaukee" would end up being sorted higher as a possible match than "Milwakee."

If you want to build this system in a generic way (as hinted by your use of $table in the query) then you'd probably need another mapping table somewhere in there to map your terms to the appropriate table.

I'm not suggesting this is the best (or even a good) way to go about this, just something I've done in the past that might prove useful to you if you plan to try and do this without a third party solution.

Beau Simensen
+2  A: 

You should check out full text searching in MySQL. Also check out Zend's port of the Apache Lucene project, Zend_Search_Lucene.

Brian Fisher
+1  A: 

More searching led me to the Levenshtein distance and then to similar_text, which proved to be the best way to do this.

similar_text("input string", "match against this", $pct_accuracy);

compares the strings and then saves the accuracy as a variable. The Levenshtein distance determines how many delete, insert, or replace functions on a single character it would need to do to get from one string to the other, with an allowance for weighting each function differently (eg. you can make it cost more to replace a character than to delete a character). It's apparently faster but less accurate than similar_text. Other posts I've read elsewhere have mentioned that for strings of fewer than 10000 characters, there's no functional difference in speed.

I ended up using a modified version of something I found to make it work. This ends up saving the top 3 results (except in the case of an exact match).

$input = $_POST["searchcity"];
$accuracy = 0;
$runner1acc = 0;
$runner2acc = 0;
while ($cityarr = mysql_fetch_row($allcities)) {
  $cityname = $cityarr[1];
  $cityid = $cityarr[0];
  $city = strtolower($cityname);
  $diff = similar_text($input, $city, $tempacc);

  // check for an exact match
  if ($tempacc == '100') {

    // closest word is this one (exact match)
    $closest = $cityname;
    $closestid = $cityid;
    $accuracy = 100;

    break;
  }

  if ($tempacc >= $accuracy) { // more accurate than current leader
    $runner2 = $runner1;
    $runner2id = $runner1id;
    $runner2acc = $runner1acc;
    $runner1 = $closest;
    $runner1id = $closestid;
    $runner1acc = $accuracy;
    $closest  = $cityname;
    $closestid = $cityid;
    $accuracy = $tempacc;
  }
  if (($tempacc < $accuracy)&&($tempacc >= $runner1acc)) { // new 2nd place
    $runner2 = $runner1;
    $runner2id = $runner1id;
    $runner2acc = $runner1acc;
    $runner1 = $cityname;
    $runner1id = $cityid;
    $runner1acc = $tempacc;
  }
  if (($tempacc < $runner1acc)&&($tempacc >= $runner2acc)) { // new 3rd place
    $runner2 = $cityname;
    $runner2id = $cityid;
    $runner2acc = $tempacc;
  }
}

echo "Input word: $input\n<BR>";
if ($accuracy == 100) {
  echo "Exact match found: $closestid $closest\n";
} elseif ($accuracy > 70) { // for high accuracies, assumes that it's correct
  echo "We think you meant $closestid $closest ($accuracy)\n";
} else {
  echo "Did you mean:<BR>";
  echo "$closestid $closest? ($accuracy)<BR>\n";
  echo "$runner1id $runner1 ($runner1acc)<BR>\n";
  echo "$runner2id $runner2 ($runner2acc)<BR>\n";
}
Daniel
A: 

if(a=1) lolloaldoasodkaslñfm,zn xd,v mad,

A: 

Most maddening result with LIKE is this one "%man" this will return all woman in file! In case of listing perhaps a not too bad solution is to keep on shortening the searching needle. In your case a match will come up when your searching $ is as short as "milwa".

alberto