fuzzy-search

Cross Referencing Databases on Fuzzy Data

I am currently working on project where I have to match up a large quantity of user-generated names with a separate list of the same names in a canonical format. The problem is that the user-generated names contains numerous misspellings, abbreviations, as well as simply invalid data, making it hard to do a cross-reference with the canon...

Whats the easiest site search application to implement, that supports fuzzy searching?

I have a site that needs to search thru about 20-30k records, which are mostly movie and TV show names. The site runs php/mysql with memcache. Im looking to replace the FULLTEXT with soundex() searching that I currently have, which works... kind of, but isn't very good in many situations. Are there any decent search scripts out there...

PHP/MySQL small-scale fuzzy search

I'm looking to implement fuzzy search for a small PHP/MySQL application. Specifically, I have a database with about 2400 records (records added at a rate of about 600 per year, so it's a small database). The three fields of interest are street address, last name and date. I want to be able to search by one of those fields, and essentiall...

Fuzzy Search on Material Descriptions including numerical sizes & general descriptions of material type

We're looking to provide a fuzzy search on an electrical materials database (i.e. conduit, cable, etc.). The problem is that, because of a lack of consistency across all material types, we could not split sizes into separate fields from the text description because some materials are rated by things other than size. I've attempted a co...

What is a good way to represent simple tabular data accessible by a SQL Server fuzzy search?

Hi all, I'm trying to define a table in a SQL Server database that will hold rules. The rule data will be keyed on a number of columns. Where rules apply to a number of scenarios I want the the key columns to contain wildcards to avoid having to maintain lots of data. I then want to find the best match row with some kind of fuzzy search...

Search Lucene with precise edit distances

I would like to search a Lucene index with edit distances. For example, say, there is a document with a field FIRST_NAME; I want all documents with first names that are 1 edit distance away from, say, 'john'. I know that Lucene supports fuzzy searches (FIRST_NAME:john~) and takes a number between 0 and 1 to control the fuzziness. The p...

Approximate/fuzzy string lookup using Tokyo Cabinet

I recently learned about Tokyo Cabinet and more precisely Tokyo Dystopia, a full-text search engine built on top of TC. I'm looking for an approximate/fuzzy text index but it doesn't seem to be supported out-of-the-box by Dystopia. However, it seems like the engine is using a q-gram inverted index so this should be a relatively simple h...

Getting fuzzy string matches from database very fast

Hello. I have a database of ~150'000 words and a pattern (any single word) and I want to get all words from the database which has Damerau-Levenshtein distance between it and the pattern less than given number. I need to do it extremely fast. What algorithm could you suggest? If there's no good algorithm for Damerau-Levenshtein distanc...

Similarity function in Postgres with pg_trgm

I'm trying to use the similarity function in Postgres to do some fuzzy text matching, however whenever I try to use it I get the error: function similarity(character varying, unknown) does not exist If I add explicit casts to text I get the error: function similarity(text, text) does not exist My query is: SELECT (similarity("tabl...

Lucene Fuzzy Match on Phrase instead of Single Word

I'm trying to do a fuzzy match on the Phrase "Grand Prarie" (deliberately misspelled) using Apache Lucene. Part of my issue is that the ~ operator only does fuzzy matches on single word terms and behaves as a proximity match for phrases. Is there a way to do a fuzzy match on a phrase with lucene? ...

"Go to file" feature in various editors

In TextMate there is a feature called "Go to file" that is used for file navigation. It is a box where you type the name of a file in your project and it will use fuzzy matching to generate a list of candidate files from which you can select. Other editors have this feature, but they each give it a different name: Vim fuzzyfinder Emac...

How to quickly find file in the workspace/switch between buffers/etc. in Eclipse?

I am looking for something like Textmate's fuzzy search on Command-T, FuzzyFinder in Vim, or Ido in Emacs. Does it exist? If no, how do you prefer to do it? ...

Algorithms for "fuzzy matching" strings

By fuzzy matching I don't mean similar strings by Levenshtein distance or something similar, but the way it's used in TextMate/Ido/Icicles: given a list of strings, find those which include all characters in the search string, but possibly with other characters between, preferring the best fit. ...

is lucene fuzzy search lazy?

I would like to use Lucene's fuzzy search, which I understand is based on some sort of Levenshtein-like algorithm. If I use a fairly high threshold (i.e, "new york~0.9"), will it first compute the edit distance and then see if it is less than whatever 0.9 corresponds to, or will it cut off the algorithm if it becomes apparent that the d...

Fuzzy runtime search without using database\index

Hello, I need to filter stream of text articles by checking every entry for fuzzy matches of predefined string(I am searching for misspelled product names, sometime they have different order of words and extra non letter characters like ":" or ","). I get excellent results by putting this articles in sphinx index and performing search...

SOLR - how to do a fuzzy search on booleans

If my index contains three boolean fields: a, b and c... I would like to search for: "a=True, b=False, c=True" and SOLR should return all entries, and their score should represent how good the whole query is matched. e.g. a=T, b=F, c=T, score=1.0 a=T, b=T, c=T, score=0.6 a=T, b=T, c=F, score=0.5 is that possible ? ...

Super fuzzy name checking?

I'm working on some stuff for an in-house CRM. The company's current frontend allows for lots of duplicates. I'm trying to stop end-users from putting in the same person because they searched for 'Bill Johnson' and not 'William Johnson.' So the user will put in some information about their new customer and we'll find the similar names (i...

Real world typo statistics?

Where can I find some real world typo statistics? I'm trying to match people's input text to internal objects, and people tend to make spelling mistakes. There are 2 kinds of mistakes: typos - "Helllo" instead of "Hello" / "Satudray" instead of "Saturday" etc. Spelling - "Shikago" instead of "Chicago" I use Damerau-Levenshte...

Fuzzy search algorithm for western European languages (in my case Swedish)

I'm looking for a fuzzy search implementation that works well with western European languages. Which algorithm works the best and where can I find an implementation in C#? Update Soundex adapted to swedish: http://escuelle.blogspot.com/2008/03/swedish-soundex.html NYSSIS implementations: http://www.gamedev.net/community/forums/...

What's textmate's 'Go to File' fuzzy search algorithm?

Textmate's 'go to file' fuzzy search is really awesome. Wincent's Command-T plugin for vim does something similar and it rocks too. Can someone explain how these work? Is there a general term for the method they use? Edit: I little more detail about what those tools do The tools let you narrow a list of options (in this case file pa...