ansaurus

Question

How to do fuzzy string search without a heavy database?

Answer 1

+2 A:

Nucular has full text search, but it doesn't support misspelled matches out-of-the box. You could try adding an additional field to each entry which indexes the SOUNDEX translation of the terms and then search using the soundex translation of the user input. I really don't know how well this would work...

Take a look at advas http://advas.sourceforge.net/news.php which has a nice demo comparing various soundex-alike methods:

advas/examples Aaron$ python phonetic_algorithms.py 
                    soundex       metaphone           nyiis      caverphone 
====================================================================================================
 schmidt :             S253           sxmtt          sssnad      SKMT111111
  schmid :             S253            sxmt          sssnad      SKMT111111
 schmitt :             S253            sxmt         sssnatt      SKMT111111
   smith :             S530            sm0h           snatt      SMT1111111
  smythe :             S530           smy0h           snatt      SMT1111111
 schmied :             S253            sxmt         sssnaad      SKMT111111
   mayer :             M600             myr           naaar      MA11111111
   meier :             M600              mr           naaar      MA11111111
....

I don't know if any of them is good for your unnamed language...

Aaron Watters 2009-05-07 13:30:32

Answer 2

+1 A:

You will get too many false positives with the SOUNDEX implementation. There are only 26,000 (at most) possible SOUNDEX codes.

Though the Metaphone algorithim was designed for English surnames, it works pretty well for misspellings; I used it once in a branch locator and it was quite successful.

Add a field with the Metaphone translation and match against it if no exact match is found. You will still get false positives, but fewer that with other algorithims.

R Ubben 2009-05-07 14:23:58

Those algorithms are optimized for English which, as I didn't mention, is not the language in question.

Tobias 2009-05-07 14:45:13

Ah, "warm blanket" fooled me, then. Still, at its heart, the algorithm is just a number of phonetic rules for pronunciation. 16 consonants with three exceptions and not more than 30 transformations or so. Implementing the algorithm in another language might not be too difficult, especially since most languages are better behaved than English. A Google search indicates that people have been doing so.

R Ubben 2009-05-07 17:23:29

Answer 3

+1 A:

The usual way of computing the edit-distance between two strings is a rather expensive algorithm (if I remember correctly, its time complexity is quadratic). Maybe if you used a different string similarity metric then your problem would go away.

One of my favorite methods for fuzzy string matching is [trigrams matching]. Comparing two strings using this method has a linear time complexity, which is much better than the mentioned edit-distance. You can find my Python implementation on [Github]. There's also a PostgreSQL [contrib module] exactly for that. It shouldn't be too difficult to adapt it to work with SQLite3.

Ryszard Szopa 2009-05-08 08:20:32

Answer 4

A:

Go to www.matchlogics.com.

2009-05-15 14:44:51

Answer 5

A:

Sybase SQL Anywhere has a free Web edition/Developer Edition and comes with full text indexing/search and a FUZZY operator (and some implementation constraints).

To quote from the documentation:

Specifying 'FUZZY "500 main street"' is equivalent to 
'500 OR mai OR ain OR str OR tre OR ree OR eet'.

An other approach would be to use scoring on full-text searches.

Vincent Buck 2009-05-15 15:02:52

Answer 6

+1 A:

Apparently the only way to make fuzzy comparisons fast is to do less of them ;)

Instead of writing another n-gram search or improving the one in Whoosh we now keep a word index, retrieve all entries that have at least one (correctly spelled) word in common with the query, and use difflib to rank those. Works well enough in this case.

Tobias 2009-07-03 20:46:04

ansaurus

tags:

views:

answers:

How to do fuzzy string search without a heavy database?

related questions