Can you suggest some light weight fuzzy text search library?
What I want to do is to allow users to find correct data for search terms with typos.
I could use full-text search engines like Lucene, but I think it's an overkill.
Edit:
To make question more clear here is a main scenario for that library:
I have a large list of strings. ...
As per this comment in a related thread, I'd like to know why Levenshtein distance based methods are better than Soundex.
...
Hey, I'm using Levenshteins algorithm to get distance between source and target string.
also I have method which returns value from 0 to 1:
/// <summary>
/// Gets the similarity between two strings.
/// All relation scores are in the [0, 1] range,
/// which means that if the score gets a maximum value (equal to 1)
/// then the two st...
I have a database of strings (arbitrary length) which holds more than one million items (potentially more).
I need to compare a user-provided string against the whole database and retrieve an identical string if it exists or otherwise return the closest fuzzy match(es) (60% similarity or better). The search time should ideally be under ...
I'm looking for high performance Java library for fuzzy string search.
There are numerous algorithms to find similar strings, Levenshtein distance, Daitch-Mokotoff Soundex, n-grams etc.
What Java implemenations exists? Pros and cons for them? I'm aware of Lucene, any other solution or Lucene is best?
I found these, anyone has experien...
How would you solve this problem?
You're scraping HTML of blogs. Some of the HTML of a blog is blog posts, some of it is formatting, sidebars, etc. You want to be able to tell what text in the HTML belongs to which post (i.e. a permalink) if any.
I know what you're thinking: You could just look at the RSS and ignore the HTML altogether...
My users will import through cut and paste a large string that will contain company names.
I have an existing and growing MYSQL database of companies names, each with a unique company_id.
I want to be able to parse through the string and assign to each of the user-inputed company names a fuzzy match.
Right now, just doing a straight-...
What is the best Fuzzy Matching Algorithm (Fuzzy Logic, N-Gram, Levenstein, Soundex ....,) to process more than 100000 records in less time?
...
Hello,
I have a pretty simple SSIS package with 3 components:
OLE DB Source
Fuzzy Lookup
OLE DB Destination
In the fuzzy lookup component I changed in the advanced tab the "Maximum number of matches to output per lookup" from 1 to 2.
When I run the package after the change I get this error message:
[OLE DB Destination [57]] Error:...
I'm building a search function for a php website using Zend Lucene and i'm having a problem.
My web site is a Shop Director (something like that).
For example i have a shop named "FooBar" but my visitors seach for "Foo Bar" and get zero results. Also if a shop is named "Foo Bar" and visitor seaches "FooBar" nothing is found.
I tried t...
I am new to the field of approximate string matching.
I am exploring uses for the Bitap algorithm, but so far its limited pattern length has me troubled. I am working with Flash, and I dispose of 32 bit unsigned integers and a IEEE-754 double-precision floating-point Number type, which can devote up to 53 bites for integers. Still, I wo...
I have a mapping of catalog numbers to product names:
35 cozy comforter
35 warm blanket
67 pillow
and need a search that would find misspelled, mixed names like "warm cmfrter".
We have code using edit-distance (difflib), but it probably won't scale to the 18000 names.
I achieved something similar with Lucene, but as PyLucene only...
I have a table Persons with personaldata and so on. There are lots of columns but the once of interest here are: addressindex, lastname and firstname where addressindex is a unique address drilled down to the door of the apartment.
So if I have 'like below' two persons with the lastname and one the firstnames are the same they are most l...
I have a Postgres table with about 5 million records and I want to find the closest match to an input key. I tried using trigrams with the pg_trgm module, but it took roughly 5 seconds per query, which is too slow for my needs.
Is there a faster way to do fuzzy match within Postgres?
...
I am developing a sharepoint portal for a suggestions and rewards system and need to alert duplicate suggestions. Suggestions will be in free text format, hence need fuzzy search. I understand that “Damerau-Levenshtein algorithm” does fuzzy search, but how do I implement in Sharepoint portal? Can Microsoft Search Server help? If yes, how...
I have a number of cuboids whose positions and sizes are given with minimum and maximum x, y and z co-ordinates (so they are parallel to the main axes).
e.g. I might have the following 3 cuboids:
10.5 <= x <= 39.4, 90.73 <= y <= 110.2, 90.23 <= z <= 95.87
20.1 <= x <= 30.05, 9.4 <= y <= 37.6, 0.1 <= z <= 91.2
10.2 <= x <= 10.3, ...
I'm wondering if there is some kind of way to do fuzzy string matching in PHP. Looking for a word in a long string, finding a potential match even if its mis-spelled; something that would find it if it was off by one character due to an OCR error.
I was thinking a regex generator might be able to do it. So given an input of "crazy" it w...
I want to find possible candidate duplicate records in a large database matching on fields like COMPANYNAME and ADDRESSLINE1
Example:
For a record with the following COMPANYNAME:
"Acme, Inc."
I would like for my query to spit out other records with these COMPANYNAME values as possible dups:
"Acme Corporation"
"Acme, Incorporated...
Here's my problem: a user searches for products by size. The result should show all products of the desired size (if any) plus products progressively larger and smaller until there are at least 50 undersized and 50 oversized products displayed in addition to the correctly-sized products.
The result should always show all products of a ...
Reciently I've looked through several implementation of bitap algorithm but what all of them do is finding the beginning point of fuzzy match. What I need is to find a match. There's an example:
Say we have following text: abcdefg
and a pattern: bzde
and we want to find all occurence of a pattern in text with at most 1 error (Edit d...