levenshtein-distance

Levenshtein distance in T-SQL

I am interested in algorithm in T-SQL calculating Levenshtein distance. ...

Levenshtein distance: how to better handle words swapping positions?

I've had some success comparing strings using the PHP levenshtein function. However, for two strings which contain substrings that have swapped positions, the algorithm counts those as whole new substrings. For example: levenshtein("The quick brown fox", "brown quick The fox"); // 10 differences are treated as having less in common ...

Question on Levenshtein distance

1) Why do we add 1 on these line? d[i-1, j] + 1, // deletion d[i, j-1] + 1, // insertion The line if s[i] = t[j] then cost := 0 else cost := 1 should take into account deleted/lower word lengths, or am I missing something? 2) Also, the comments state deletion and insertion. Am I right in thinking that it's chec...

Edit Distance Algorithm

I have a dictionary of 'n' words given and there are 'm' Queries to respond to. I want to output the number of words in dictionary which are edit distance 1 or 2. I want to optimize the result set given that n and m are roughly 3000. Edit added from answer below: I will try to word it differently. Initially there are 'n' words given ...

Compare 5000 strings with PHP Levenshtein

I have 5000, sometimes more, street address strings in an array. I'd like to compare them all with levenshtein to find similar matches. How can I do this without looping through all 5000 and comparing them directly with every other 4999? Edit: I am also interested in alternate methods if anyone has suggestions. The overall goal is to fi...

Levenshtein distance on non-English strings

Will the Levenshtein distance algorithm work well for non-English language strings too? Update: Would this work automatically in a language like Java when comparing Asian characters? ...

What algorithm gives suggestions in a spell checker?

What algorithm is typically used when implementing a spell checker that is accompanied with word suggestions? At first I thought it might make sense to check each new word typed (if not found in the dictionary) against it's Levenshtein distance from every other word in the dictionary and returning the top results. However, this seems l...

Best way to detect similar email addresses?

I have a list of ~20,000 email addresses, some of which I know to be fraudulent attempts to get around a "1 per e-mail" limit, such as [email protected], [email protected], [email protected], etc. I want to find similar email addresses for evaluation. Currently I'm using a Levenshtein algorithm to check each e-mail against the ot...

How can I optimize retrieving lowest edit distance from a large table in SQL?

Hey, I'm having troubles optimizing this Levenshtein Distance calculation I'm doing. I need to do the following: Get the record with the minimum distance for the source string as well as a trimmed version of the source string Pick the record with the minimum distance If the min distances are equal (original vs trimmed), choose the trim...

Optimizing Levenshtein distance algorithm

I have a stored procedure that uses Levenshtein distance to determine the result closest to what the user typed. The only thing really affecting the speed is the function that calculates the Levenshtein distance for all the records before selecting the record with the lowest distance (I've verified this by putting a 0 in place of the cal...

Writing a post search algorithm.

I'm trying to write a free text search algorithm for finding specific posts on a wall (similar kind of wall as Facebook uses). A user is suppose to be able to write some words in a search field and get hits on posts that contain the words; with the best match on top and then other posts in decreasing order according to match score. I'm ...

Generate a set of strings with maximum edit distance

Problem 1: I'd like to generate a set of n strings of fixed length m from alphabet s such that the minimum Levenshtein distance (edit distance) between any two strings is greater than some constant c. Obviously, I can use randomization methods (e.g., a genetic algorithm), but was hoping that this may be a well-studied problem in compute...

Damerau-Levenshtein php

I'm searching for an implementations of the Damerau–Levenshtein algorithm for PHP, but it seems that I can't find anything with my friend google. So far I have to use PHP implemented Levenshtein (without Damerau transposition, which is very important), or get a original source code (in C, C++, C#, Perl) and write (translate) it to PHP. ...

Fast Levenshtein distance in R?

Is there a package that contains Levenshtein distance counting function which is implemented as a C or Fortran code? I have many strings to compare and stringMatch from MiscPsycho is too slow for this. ...

Most efficient way to calculate Levenshtein distance

I just implemented a best match file search algorithm to find the closest match to a string in a dictionary. After profiling my code, I found out that the overwhelming majority of time is spent calculating the distance between the query and the possible results. I am currently implementing the algorithm to calculate the Levenshtein Dista...

Finding closest neighbour using optimized Levenshtein Algorithm

I recently posted a question about optimizing the algorithm to compute the Levenshtein Distance, and the replies lead me to the Wikipedia article on Levenshtein Distance. The article mentioned that if there is a bound k on the maximum distance a possible result can be from the given query, then the running time can be reduced from O(mn)...

How can I create a threshold for similar strings using Levenshtein distance and account for typos?

We recently encountered an interesting problem at work where we discovered duplicate user submitted data in our database. We realized that the Levenshtein distance between most of this data was simply the difference between the 2 strings in question. That indicates that if we simply add characters from one string into the other then we e...

Levenshtein Distance on only part of a string (Java)

I have an online web application with a top menu tree for opening different widgets for performing different tasks. As the app grows more powerful, that tree has become large and difficult to navigate. I've implemented a search feature, where users can just type the menu name or part of it and I use regex to find all items in the menu ...

How to correct bugs in this Damerau-Levenshtein implementation?

I'm back with another longish question. Having experimented with a number of Python-based Damerau-Levenshtein edit distance implementations, I finally found the one listed below as editdistance_reference(). It seems to deliver correct results and appears to have an efficient implementation. So I set down to convert the code to Cython. o...

Is Levenshtein slow in MySQL?

Yesterday I had a question where people suggested I use Levenshtein method. Is it a slow query? Maybe I can use something else? ...