fuzzy-comparison

Good Python modules for fuzzy string comparison?

I'm looking for a Python module that can do simple fuzzy string comparisons. Specifically, I'd like a percentage of how similar the strings are. I know this is potentially subjective so I was hoping to find a library that can do positional comparisons as well as longest similar string matches, among other things. Basically, I'm hoping...

How do I determine whether a number is within a percentage of another number

I'm writing iPhone code that fuzzily recognizes whether a swiped line is straight-ish. I get the bearing of the two end points and compare it to 0, 90, 180 and 270 degrees with a tolerance of 10 degrees plus or minus. Right now I do it with a bunch of if blocks, which seems super clunky. How to write a function that, given the bearing 0...

Is it possible to compare two tables when no common key exists between them?

I have two tables that I would like to compare for duplicates. These tables are just basic company information fields like name, city, state, etc. The only possibly common field that I can see would be the name column but the names are not quite exact. Is there a way that I can run a comparison between the two using a LIKE statement? I'm...

Fuzzy Text Search: Regex Wildcard Search Generator?

I'm wondering if there is some kind of way to do fuzzy string matching in PHP. Looking for a word in a long string, finding a potential match even if its mis-spelled; something that would find it if it was off by one character due to an OCR error. I was thinking a regex generator might be able to do it. So given an input of "crazy" it w...

TSQL Query for analyzing Text.

I have a table that has ordernumber, cancelled date and reason. Reason field is varchar(255) field and it was written by many different sales rep and really hard to group by the reason category I need to generate a report to categorize cancelation reasons. What is the best way to analyse the reasons with TSQL? Sample of reasons entered ...

q-gram approximate matching optimisations

Hi I have a table containing 3 million people records on which I want to perform fuzzy matching using q-grams (on surname for instance). I have created a table of 2-grams linking to this, but search performance is not great on this data volume (around 5 minutes). I basically have two questions: (1) Can you suggest any ways to improve p...

How can I recognize slightly modified images?

I have a very large database of jpeg images, about 2 million. I would like to do a fuzzy search for duplicates among those images. Duplicate images are two images that have many (around half) of their pixels with identical values and the rest are off by about +/- 3 in their R/G/B values. The images are identical to the naked eye. It'...

What would cause a Fuzzy Lookup to return a Null set of values from the reference table?

I'm doing a fuzzy lookup on a view of a table which does a fine job returning similarities with the occasional exception, and I can't seem to figure out what is causing the problem. Every so often, the comparison will come up with null values from the lookup view, even though the values exist in both the view and the original table and t...

Fuzzy Regular Expressions

In my work I have with great results used approximate string matching algorithms such as Damerau–Levenshtein distance to make my code less vulnerable to spelling mistakes. Now I have a need to match strings against simple regular expressions such TV Schedule for \d\d (Jan|Feb|Mar|...). This means that the string TV Schedule for 10 Jan s...

Comparing (similar) images with Python/PIL

I'm trying to calculate the similarity (read: Levenshtein distance) of two images, using Python 2.6 and PIL. I plan to us e the python-levenshtein library for fast comparison. Main question: What is a good strategy for comparing images? My idea is something like: Convert to RGB (transparent -> white) (or maybe convert to monochrome?...

Using MinHash to find similiarities between 2 images

I am using MinHash algorithm to find similar images between images. I have run across this post, How can I recognize slightly modified images? which pointed me to MinHash algorithm. I was using a C# implementation from this blog post, Set Similarity and Min Hash. But while trying to use the implementation, I have run into 2 problems. ...

Anything wrong with this function for comparing floats?

When my Floating-Point Guide was yesterday published on slashdot, I got a lot of flak for my suggested comparison function, which was indeed inadequate. So I finally did the sensible thing and wrote a test suite to see whether I could get them all to pass. Here is my result so far. And I wonder if this is really as good as one can get wi...

Fuzzy match two hash tables?

Hi, I'm looking for ideas on how to best match two hash tables containing string key/value pairs. Here's the actual problem I'm facing: I have structured data coming in which is imported into the database. I need to UPDATE records which are already in the DB, however, it's possible that ANY value in the source can change, therefore I d...

Generate "fuzzy" difference of two files in Python.

Hello all, I have an issue for comparing two files. Basically, what I want to do is a UNIX-like diff between two files, for example: $ diff -u left-file right-file However my two files contain floats; and because these files were generated on distinct architectures (but computing the same things), the floating values are not exactly th...

Canonical URL compare in Python?

Are there any tools to do a URL compare in Python? For example, if I have http://google.com and google.com/ I'd like to know that they are likely to be the same site. If I were to construct a rule manually, I might Uppercase it, then strip off the http:// portion, and drop anything after the last alpha-numeric character.. But I can se...

How to group / compare similar news articles

In an app that i'm creating, I want to add functionality that groups news stories together. I want to group news stories about the same topic from different sources into the same group. For example, an article on XYZ from CNN and MSNBC would be in the same group. I am guessing its some sort of fuzzy logic comparison. How would I go a...

How can I use jaro-winkler to find the closest value in a table?

I have an implementation of the jaro-winkler algorithm in my database. I did not write this function. The function compares two values and gives the probability of match. So jaro(string1, string2, matchnoofchars) will return a result. Instead of comparing two strings, I want to send one string with a matchnoofchars and then get a resu...

Sorting items in JQuery on a variety of attributes ...

I need to give users the functionality to sort a list of products based on several pre-selected field names. The products list is structured roughly like ... <span class="productList"> <div class="product"> <p><strong class="sortName">Title</strong></p> <p>Weight: <span class="sortWeight">3.50</span> pounds</p> <p>Price: <span ...

Using pen strokes with fuzzy tolerance algorithm as encryption key

How can I encrypt/decrypt with fuzzy tolerance? I want to be able to use a Stroke on an InkCanvas as key for my encryption but when decrypting again the user should not have to draw the exact same symbol, only similar. Can this be done in .NET C#? --- Update (9 sep) --- What I ideally want is an encryption algorithm that would accept ...

Fuzzy matching API in a long list of queries

I have an application which lets people ask predefined queries. However, the list of such queries is too long. Hence, the current approach is to let users enter a word in the search box and then show them the likely matches from the list of queries. ( Very much like google's "Did you mean" feature.) Is there an API in Java available for...