fuzzy-search

Lightweight fuzzy search library

Can you suggest some light weight fuzzy text search library? What I want to do is to allow users to find correct data for search terms with typos. I could use full-text search engines like Lucene, but I think it's an overkill. Edit: To make question more clear here is a main scenario for that library: I have a large list of strings. ...

Levenshtein distance based methods Vs Soundex

As per this comment in a related thread, I'd like to know why Levenshtein distance based methods are better than Soundex. ...

Fuzzy text (sentences/titles) matching in C#

Hey, I'm using Levenshteins algorithm to get distance between source and target string. also I have method which returns value from 0 to 1: /// <summary> /// Gets the similarity between two strings. /// All relation scores are in the [0, 1] range, /// which means that if the score gets a maximum value (equal to 1) /// then the two st...

How to find best fuzzy match for a string in a large string database

I have a database of strings (arbitrary length) which holds more than one million items (potentially more). I need to compare a user-provided string against the whole database and retrieve an identical string if it exists or otherwise return the closest fuzzy match(es) (60% similarity or better). The search time should ideally be under ...

fuzzy string search in Java

I'm looking for high performance Java library for fuzzy string search. There are numerous algorithms to find similar strings, Levenshtein distance, Daitch-Mokotoff Soundex, n-grams etc. What Java implemenations exists? Pros and cons for them? I'm aware of Lucene, any other solution or Lucene is best? I found these, anyone has experien...

Map RSS entries to HTML body w. non-exact search

How would you solve this problem? You're scraping HTML of blogs. Some of the HTML of a blog is blog posts, some of it is formatting, sidebars, etc. You want to be able to tell what text in the HTML belongs to which post (i.e. a permalink) if any. I know what you're thinking: You could just look at the RSS and ignore the HTML altogether...

How do I do a fuzzy match of company names in MYSQL with PHP for auto-complete?

My users will import through cut and paste a large string that will contain company names. I have an existing and growing MYSQL database of companies names, each with a unique company_id. I want to be able to parse through the string and assign to each of the user-inputed company names a fuzzy match. Right now, just doing a straight-...

Best Fuzzy Matching Algorithm?

What is the best Fuzzy Matching Algorithm (Fuzzy Logic, N-Gram, Levenstein, Soundex ....,) to process more than 100000 records in less time? ...

SSIS fuzzy lookup with multiple outputs per lookup error

Hello, I have a pretty simple SSIS package with 3 components: OLE DB Source Fuzzy Lookup OLE DB Destination In the fuzzy lookup component I changed in the advanced tab the "Maximum number of matches to output per lookup" from 1 to 2. When I run the package after the change I get this error message: [OLE DB Destination [57]] Error:...

How to find "FooBar" when seaching "Foo Bar" in Zend Lucene

I'm building a search function for a php website using Zend Lucene and i'm having a problem. My web site is a Shop Director (something like that). For example i have a shop named "FooBar" but my visitors seach for "Foo Bar" and get zero results. Also if a shop is named "Foo Bar" and visitor seaches "FooBar" nothing is found. I tried t...

Overcoming the Bitap algorithm's search pattern length

I am new to the field of approximate string matching. I am exploring uses for the Bitap algorithm, but so far its limited pattern length has me troubled. I am working with Flash, and I dispose of 32 bit unsigned integers and a IEEE-754 double-precision floating-point Number type, which can devote up to 53 bites for integers. Still, I wo...

How to do fuzzy string search without a heavy database?

I have a mapping of catalog numbers to product names: 35 cozy comforter 35 warm blanket 67 pillow and need a search that would find misspelled, mixed names like "warm cmfrter". We have code using edit-distance (difflib), but it probably won't scale to the 18000 names. I achieved something similar with Lucene, but as PyLucene only...

Fuzzy matching using T-SQL

I have a table Persons with personaldata and so on. There are lots of columns but the once of interest here are: addressindex, lastname and firstname where addressindex is a unique address drilled down to the door of the apartment. So if I have 'like below' two persons with the lastname and one the firstnames are the same they are most l...

Is there a postgres fuzzy match faster than pg_trgm?

I have a Postgres table with about 5 million records and I want to find the closest match to an input key. I tried using trigrams with the pg_trgm module, but it took roughly 5 seconds per query, which is too slow for my needs. Is there a faster way to do fuzzy match within Postgres? ...

Implement fuzzy search in a sharepoint portal

I am developing a sharepoint portal for a suggestions and rewards system and need to alert duplicate suggestions. Suggestions will be in free text format, hence need fuzzy search. I understand that “Damerau-Levenshtein algorithm” does fuzzy search, but how do I implement in Sharepoint portal? Can Microsoft Search Server help? If yes, how...

How do I determine which cuboids a point is in without iterating over them all?

I have a number of cuboids whose positions and sizes are given with minimum and maximum x, y and z co-ordinates (so they are parallel to the main axes). e.g. I might have the following 3 cuboids: 10.5 <= x <= 39.4, 90.73 <= y <= 110.2, 90.23 <= z <= 95.87 20.1 <= x <= 30.05, 9.4 <= y <= 37.6, 0.1 <= z <= 91.2 10.2 <= x <= 10.3, ...

Fuzzy Text Search: Regex Wildcard Search Generator?

I'm wondering if there is some kind of way to do fuzzy string matching in PHP. Looking for a word in a long string, finding a potential match even if its mis-spelled; something that would find it if it was off by one character due to an OCR error. I was thinking a regex generator might be able to do it. So given an input of "crazy" it w...

A good SQL strategy for fuzzy matching possible duplicates using SQL Server 2005

I want to find possible candidate duplicate records in a large database matching on fields like COMPANYNAME and ADDRESSLINE1 Example: For a record with the following COMPANYNAME: "Acme, Inc." I would like for my query to spit out other records with these COMPANYNAME values as possible dups: "Acme Corporation" "Acme, Incorporated...

I need a "fuzzy" query to get products above and below a given dimension

Here's my problem: a user searches for products by size. The result should show all products of the desired size (if any) plus products progressively larger and smaller until there are at least 50 undersized and 50 oversized products displayed in addition to the correctly-sized products. The result should always show all products of a ...

Finding a fuzzy match with bitap algorithm

Reciently I've looked through several implementation of bitap algorithm but what all of them do is finding the beginning point of fuzzy match. What I need is to find a match. There's an example: Say we have following text: abcdefg and a pattern: bzde and we want to find all occurence of a pattern in text with at most 1 error (Edit d...