views:

425

answers:

2

I'm looking to implement fuzzy search for a small PHP/MySQL application. Specifically, I have a database with about 2400 records (records added at a rate of about 600 per year, so it's a small database). The three fields of interest are street address, last name and date. I want to be able to search by one of those fields, and essentially have tolerance for spelling/character errors. i.e., an address of "123 Main Street" should also match "123 Main St", "123 Main St.", "123 Mian St", "123 Man St", "132 Main St", etc. and likewise for name and date.

The main issues I have with answers to other similar questions:

  • It's impossible to define synonyms for every possible incorrect spelling, forget doing so for dates and names.
  • Lucene, etc. seems very heavy-weight for such a limited search data set (call it a maximum of 5,000 records, 3 fields per record).
  • Just doing something with wildcards doesn't seem logical with all of the possible spelling errors.

Any suggestions? I know it isn't going to be possible to do natively with MySQL, but since the data set is so limited, I'd like to keep it relatively simple... perhaps a PHP class that gets all of the records from the DB, uses some sort of comparison algorithm, and returns the IDs of the similar records?

Thanks, Jason

+1  A: 

If it is a very small database, you could load all the data at once and use an algorithm like Jaro-Winkler for your search. They have an implementation in PHP, which you can find here.

Imho it works really well. Take a look at an example implementation here. I know that that search uses the same algorithm, and it can find 'Nintedno' very well. It also sorts the results for you, based on which result best matches your query.

Razzie
+1  A: 

Razzie's answer (or using Damerau–Levenshtein) ranks a list of candidates matches according to their closeness to the search key. (Take care: if the key is "12 Main St" then "13 Main St" has the same typing distance as "12 Moin St" but you might want to rank it low or even exclude it, as with 11 and 22 Main St etc.)

But how do you select a list of candidates of a manageable size to rank?

One way is to compute the metaphone value (or values, using double-metaphone) for each word in the strings your going to search. Save each of these metaphones in another table with the id of the row containing the original string. You can then search these metaphone values quickly with LIKE 'key%' where key is the metaphone of a word from the search text.

Check out the suggested answer on this thread. It's quite neat and should work nicely for DBs that aren't huge.

fsb