views:

1410

answers:

6

I have a mapping of catalog numbers to product names:

35  cozy comforter
35  warm blanket
67  pillow

and need a search that would find misspelled, mixed names like "warm cmfrter".

We have code using edit-distance (difflib), but it probably won't scale to the 18000 names.

I achieved something similar with Lucene, but as PyLucene only wraps Java that would complicate deployment to end-users.

SQLite doesn't usually have full-text or scoring compiled in.

The Xapian bindings are like C++ and have some learning curve.

Whoosh is not yet well-documented but includes an abusable spell-checker.

What else is there?

+2  A: 

Nucular has full text search, but it doesn't support misspelled matches out-of-the box. You could try adding an additional field to each entry which indexes the SOUNDEX translation of the terms and then search using the soundex translation of the user input. I really don't know how well this would work...

Take a look at advas http://advas.sourceforge.net/news.php which has a nice demo comparing various soundex-alike methods:

advas/examples Aaron$ python phonetic_algorithms.py 
                    soundex       metaphone           nyiis      caverphone 
====================================================================================================
 schmidt :             S253           sxmtt          sssnad      SKMT111111
  schmid :             S253            sxmt          sssnad      SKMT111111
 schmitt :             S253            sxmt         sssnatt      SKMT111111
   smith :             S530            sm0h           snatt      SMT1111111
  smythe :             S530           smy0h           snatt      SMT1111111
 schmied :             S253            sxmt         sssnaad      SKMT111111
   mayer :             M600             myr           naaar      MA11111111
   meier :             M600              mr           naaar      MA11111111
....

I don't know if any of them is good for your unnamed language...

Aaron Watters
+1  A: 

You will get too many false positives with the SOUNDEX implementation. There are only 26,000 (at most) possible SOUNDEX codes.

Though the Metaphone algorithim was designed for English surnames, it works pretty well for misspellings; I used it once in a branch locator and it was quite successful.

Add a field with the Metaphone translation and match against it if no exact match is found. You will still get false positives, but fewer that with other algorithims.

R Ubben
Those algorithms are optimized for English which, as I didn't mention, is not the language in question.
Tobias
Ah, "warm blanket" fooled me, then. Still, at its heart, the algorithm is just a number of phonetic rules for pronunciation. 16 consonants with three exceptions and not more than 30 transformations or so. Implementing the algorithm in another language might not be too difficult, especially since most languages are better behaved than English. A Google search indicates that people have been doing so.
R Ubben
+1  A: 

The usual way of computing the edit-distance between two strings is a rather expensive algorithm (if I remember correctly, its time complexity is quadratic). Maybe if you used a different string similarity metric then your problem would go away.

One of my favorite methods for fuzzy string matching is [trigrams matching]. Comparing two strings using this method has a linear time complexity, which is much better than the mentioned edit-distance. You can find my Python implementation on [Github]. There's also a PostgreSQL [contrib module] exactly for that. It shouldn't be too difficult to adapt it to work with SQLite3.

Ryszard Szopa
A: 

Go to www.matchlogics.com.

A: 

Sybase SQL Anywhere has a free Web edition/Developer Edition and comes with full text indexing/search and a FUZZY operator (and some implementation constraints).

To quote from the documentation:

Specifying 'FUZZY "500 main street"' is equivalent to 
'500 OR mai OR ain OR str OR tre OR ree OR eet'.

An other approach would be to use scoring on full-text searches.

Vincent Buck
+1  A: 

Apparently the only way to make fuzzy comparisons fast is to do less of them ;)

Instead of writing another n-gram search or improving the one in Whoosh we now keep a word index, retrieve all entries that have at least one (correctly spelled) word in common with the query, and use difflib to rank those. Works well enough in this case.

Tobias