ansaurus

Question

Answer 1

+2 A:

Have a look at the Levenshtein distance. This is a way of measuring how different two strings are from one another.

Hopefully I understood your question correctly; using "distance" in the same sentence as "latitude and longitude" could be confusing!

Greg Hewgill 2009-05-25 20:45:11

My fault.. using "distance" IS confusing. As far as lat and long are concerned I really meant the phisical distance. As far as the strings are concerned I meant the "differences" between the two strings.The Levenshtein distance seems intresting, it would be perfect if there was a "ready to use" library for distance measuring...

PieroP 2009-05-25 20:52:27

PHP has a Levenshtein distance function built in: http://www.php.net/manual/en/function.levenshtein.php

Pourquoi Litytestdata 2009-05-25 21:00:49

Thanks for the input

PieroP 2009-05-25 21:39:27

Answer 2

+2 A:

Although written in c (with python and tcl bindings), libdistance would be a tool for applying several distances metrics on strings/data.

Metrics included:

bloom
damerau
euclid
hamming
jaccard
levenshtein
manhattan
minkowski
needleman_wunsch

The MYYN 2009-05-25 20:59:10

Answer 3

A:

In the meanwhile I found SumMetrics in Java, has anybody ever used it?

PieroP 2009-05-25 21:01:04

I checked out their Levenshtein implementation, and I dare say that the one I provided in my post uses less memory (although that is less of an issue with short strings).

Cecil Has a Name 2009-05-25 22:24:42

Answer 4

A:

I took the liberty to translate a piece of C# code I've written to calculate the Levenshtein distance into Java code. It uses only two single-dimension arrays that alternate instead of a big jagged array:

public static int getDifference(String a, String b)
{
    // Minimize the amount of storage needed:
    if (a.length() > b.length())
    {
     // Swap:
     String x = a;
     a = b;
     b = x;
    }

    // Store only two rows of the matrix, instead of a big one
    int[] mat1 = new int[a.length() + 1];
    int[] mat2 = new int[a.length() + 1];

    int i;
    int j;

    for (i = 1; i <= a.length(); i++)
     mat1[i] = i;

    mat2[0] = 1;

    for (j = 1; j <= b.length(); j++)
    {
     for (i = 1; i <= a.length(); i++)
     {
      int c = (a.charAt(i - 1) == b.charAt(j - 1) ? 0 : 1);

      mat2[i] =
       Math.min(mat1[i - 1] + c,
       Math.min(mat1[i] + 1, mat2[i - 1] + 1));
     }

     // Swap:
     int[] x = mat1;
     mat1 = mat2;
     mat2 = x;

     mat2[0] = mat1[0] + 1;
    }

    // It's row #1 because we swap rows at the end of each outer loop,
    // as we are to return the last number on the lowest row
    return mat1[a.length()];
}

It is not rigorously tested, but it seems to be working okay. It was based on a Python implementation I made for a university exercise. Hope this helps!

Cecil Has a Name 2009-05-25 21:50:00

Answer 5

A:

You might get some decent results using a phonetic algorithm to find slightly misspelld names.

Also, if you use a more mechanical edit distance, you'll probably see better results using a weighted function that accounts for keyboard geometry (i.e. physically close keys are "cheaper" to replace than far off ones). That's a patented method btw, so be careful not to write something that becomes too popular ;)

Christoffer 2009-05-25 22:22:42

How can such a simple (but brilliant) idea be patented? :P Or was it the exact technique to honor the keyboard mapping?

Cecil Has a Name 2009-05-25 22:26:10

Because software algorithms can be patented in some legally backwards jurisdictions :) I'm just an engineer so I've never bothered to look up the details there, just trusting the company legal advisors.

Christoffer 2009-05-25 22:38:14

The idea of the phonetic algorithm is very nice. Is there any library to implement this feature?

PieroP 2009-05-26 07:00:48

The SimMetrics library you found appears to have some phonetics for .NET on http://sourceforge.net/project/showfiles.php?group_id=123463

Cecil Has a Name 2009-05-26 08:55:55

Answer 6

A:

I would recommend either Levenshtein Distance or the Jaccard Distance for comparing text.

Thomas Bratt 2009-05-26 13:34:42

ansaurus

tags:

views:

answers:

String distance library

related questions