edit-distance

How do you implement Levenshtein distance in Delphi?

I'm posting this in the spirit of answering your own questions. The question I had was: How can I implement the Levenshtein algorithm for calculating edit-distance between two strings, as described here, in Delphi? Just a note on performance: This thing is very fast. On my desktop (2.33 Ghz dual-core, 2GB ram, WinXP), I can run throug...

Levenshtein distance in T-SQL

I am interested in algorithm in T-SQL calculating Levenshtein distance. ...

Levenshtein distance: how to better handle words swapping positions?

I've had some success comparing strings using the PHP levenshtein function. However, for two strings which contain substrings that have swapped positions, the algorithm counts those as whole new substrings. For example: levenshtein("The quick brown fox", "brown quick The fox"); // 10 differences are treated as having less in common ...

Is there an edit distance algorithm that takes "chunk transposition" into account?

I put "chunk transposition" in quotes because I don't know whether or what the technical term should be. Just knowing if there is a technical term for the process would be very helpful. The Wikipedia article on edit distance gives some good background on the concept. By taking "chunk transposition" into account, I mean that Turing, Al...

Shortest path to transform one word into another

For a Data Structures project, I must find the shortest path between two words (like "cat" and "dog), changing only one letter at a time. We are given a Scrabble word list to use in finding our path. For example: cat -> bat -> bet -> bot -> bog -> dog I've solved the problem using a breadth first search, but am seeking something bette...

Efficient way of calculating likeness scores of strings when sample size is large?

Let's say that you have a list of 10,000 email addresses, and you'd like to find what some of the closest "neighbors" in this list are - defined as email addresses that are suspiciously close to other email addresses in your list. I'm aware of how to calculate the Levenshtein distance between two strings (thanks to this question), which...

Using base64 encoding as a mechanism to detect changes

Is it possible to detect changes in the base64 encoding of an object to detect the degree of changes in the object. Suppose I send a document attachment to several users and each makes changes to it and emails back to me, can I use the string distance between original base64 and the received base64s to detect which version has the most...

Working with edit distance and then processing the results to find chunks / groups

Hi, After processing a dictionary of words I have edit distances (or rather similarity in percent) saved in a data structure, kinda like this: s1=String1, s2=String2, similarity=82 s1=String2, s2=String3, similarity=82 s1=aaaaaaa, s2=aaaaaab, similarity=90 s1=aaaaaaa, s2=aaaaaac, similarity=95 My aim is to have a list of groups of simi...

Optimizing Levenshtein distance algorithm

I have a stored procedure that uses Levenshtein distance to determine the result closest to what the user typed. The only thing really affecting the speed is the function that calculates the Levenshtein distance for all the records before selecting the record with the lowest distance (I've verified this by putting a 0 in place of the cal...

how to convert python/cython unicode string to array of long integers, to do levenshtein edit distance

I have the following Cython code (adapted from the bpbio project) that does Damerau-Levenenshtein edit-distance calculation: #--------------------------------------------------------------------------- cdef extern from "stdlib.h": ctypedef unsigned int size_t size_t strlen(char *s) void *malloc(size_t size) void *calloc(size_t n...

How to correct bugs in this Damerau-Levenshtein implementation?

I'm back with another longish question. Having experimented with a number of Python-based Damerau-Levenshtein edit distance implementations, I finally found the one listed below as editdistance_reference(). It seems to deliver correct results and appears to have an efficient implementation. So I set down to convert the code to Cython. o...