views:

563

answers:

6

Hi,

I need to measure the distance between two strings (names of places). Since some times the names are written slightl differently I was looking for a library that could help me measure the difference and then combine it with a measure of the latitude and longitude to select the correct matches. Preferred languages: Java or Php

Any suggestion?

Thanks, Piero

+2  A: 

Have a look at the Levenshtein distance. This is a way of measuring how different two strings are from one another.

Hopefully I understood your question correctly; using "distance" in the same sentence as "latitude and longitude" could be confusing!

Greg Hewgill
My fault.. using "distance" IS confusing. As far as lat and long are concerned I really meant the phisical distance. As far as the strings are concerned I meant the "differences" between the two strings.The Levenshtein distance seems intresting, it would be perfect if there was a "ready to use" library for distance measuring...
PieroP
PHP has a Levenshtein distance function built in: http://www.php.net/manual/en/function.levenshtein.php
Pourquoi Litytestdata
Thanks for the input
PieroP
+2  A: 

Although written in c (with python and tcl bindings), libdistance would be a tool for applying several distances metrics on strings/data.

Metrics included:

  • bloom
  • damerau
  • euclid
  • hamming
  • jaccard
  • levenshtein
  • manhattan
  • minkowski
  • needleman_wunsch
The MYYN
A: 

In the meanwhile I found SumMetrics in Java, has anybody ever used it?

PieroP
I checked out their Levenshtein implementation, and I dare say that the one I provided in my post uses less memory (although that is less of an issue with short strings).
Cecil Has a Name
A: 

I took the liberty to translate a piece of C# code I've written to calculate the Levenshtein distance into Java code. It uses only two single-dimension arrays that alternate instead of a big jagged array:

public static int getDifference(String a, String b)
{
    // Minimize the amount of storage needed:
    if (a.length() > b.length())
    {
     // Swap:
     String x = a;
     a = b;
     b = x;
    }

    // Store only two rows of the matrix, instead of a big one
    int[] mat1 = new int[a.length() + 1];
    int[] mat2 = new int[a.length() + 1];

    int i;
    int j;

    for (i = 1; i <= a.length(); i++)
     mat1[i] = i;

    mat2[0] = 1;

    for (j = 1; j <= b.length(); j++)
    {
     for (i = 1; i <= a.length(); i++)
     {
      int c = (a.charAt(i - 1) == b.charAt(j - 1) ? 0 : 1);

      mat2[i] =
       Math.min(mat1[i - 1] + c,
       Math.min(mat1[i] + 1, mat2[i - 1] + 1));
     }

     // Swap:
     int[] x = mat1;
     mat1 = mat2;
     mat2 = x;

     mat2[0] = mat1[0] + 1;
    }

    // It's row #1 because we swap rows at the end of each outer loop,
    // as we are to return the last number on the lowest row
    return mat1[a.length()];
}

It is not rigorously tested, but it seems to be working okay. It was based on a Python implementation I made for a university exercise. Hope this helps!

Cecil Has a Name
A: 

You might get some decent results using a phonetic algorithm to find slightly misspelld names.

Also, if you use a more mechanical edit distance, you'll probably see better results using a weighted function that accounts for keyboard geometry (i.e. physically close keys are "cheaper" to replace than far off ones). That's a patented method btw, so be careful not to write something that becomes too popular ;)

Christoffer
How can such a simple (but brilliant) idea be patented? :P Or was it the exact technique to honor the keyboard mapping?
Cecil Has a Name
Because software algorithms can be patented in some legally backwards jurisdictions :) I'm just an engineer so I've never bothered to look up the details there, just trusting the company legal advisors.
Christoffer
The idea of the phonetic algorithm is very nice. Is there any library to implement this feature?
PieroP
The SimMetrics library you found appears to have some phonetics for .NET on http://sourceforge.net/project/showfiles.php?group_id=123463
Cecil Has a Name
A: 

I would recommend either Levenshtein Distance or the Jaccard Distance for comparing text.

Thomas Bratt