views:

684

answers:

8

I have 5000, sometimes more, street address strings in an array. I'd like to compare them all with levenshtein to find similar matches. How can I do this without looping through all 5000 and comparing them directly with every other 4999?

Edit: I am also interested in alternate methods if anyone has suggestions. The overall goal is to find similar entries (and eliminate duplicates) based on user-submitted street addresses.

+1  A: 

I think you cannot avoid looping through the array as the levenstein() function takes only strings and not an array as input.

You can do something like:

for($i=0;$i<count($array)-1;$i++)
{
    for($j=$i+1;$j<count($array);$j++)
    {
     $lev = levenshtein($array[$i],$array[$j]);
     if($lev == 0)
     {
      // exact match
     }
     else if($lev <= THRESHOLD)
     {
      // similar
     }
    }
}
codaddict
I am afraid of this as it would do 25 million comparisons.
phirschybar
@phirschybar: Actually it will be 12.5 Million, we are comparing only distinct pairs, if pair(X,Y) is compared, we skip comparing pair(Y,X).
codaddict
@bzabhi: touche! :)
phirschybar
+2  A: 

Due to the nature of the Levenshtein algorithm (specifically the fact that it's a comparision between two strings), I can't see how this is possible.

You could of course reduce the number of comparisons by performing some basic matching requirements first, but this is out of the scope of what you're asking.

As a (quite possibly irrelevant) option, you could always use something like soundex which would let you pre-compute the string values. (You can also use it directly in MySQL I believe.)

middaparka
+2  A: 

You could group them based on soundexes then limit the comparisons to the nearest N cases...

 $mashed=array();
 foreach ($address as $key=>$val) {
      $mashed[$key]=soundex($val);
 }
 sort($mashed);

Then iterate through the keys of $mashed.

C.

symcbean
I have never worked with soundexes. Any idea how well they work with street address type abbreviations like "St." ?
phirschybar
Soundex ( http://en.wikipedia.org/wiki/Soundex ) is designed to work with people names. If you apply them to address, it makes sense to apply soundex algorithm to each part of address e.g "23 Bird Road" -> SOUNDEX("Bird") and SOUNDEX("Road")
Juha Syrjälä
This solution would be something like `O(2n)` or `O(2nm)` (both unsimplified), where `m` is each word in the address. This is a great improvement over `O(n^2)`.
Justin Johnson
It can't interpolate abbreviations - but St could be Street or Saint or...? depending on context.<?phpprint soundex('ST');print "\n\n";print soundex('street');print "\n\n";print soundex('stuff');?>C.
symcbean
+1  A: 

Given you problem, I see no other way than to compare every address with every other address if you want to use Lehvenstein distance.

First of all, you should normalize addressess, get rid of abbreviations etc.

  • Ave -> Avenue
  • Rd. -> Road

You could have some fixed max Lehvenstein distance ( N ) for similar addresses.

If so, you could you could abort the Lehvenstein algorithm when you are sure that the edit distance for current address pair is larger than N. For this you need to write a custom version of Lehvenstein algorithm. This will make the algorithm a little quicker.

There are also some related trivial optimizations. For example: if address A is 10 chars long and address B is 20 chars long and you consider addresses that have Lehvenstein distance of less than 8 to be similar. You can look lengths of addresses and immediately decide that they are not similar.

Juha Syrjälä
A: 

If you want to find all similar values, you will have to compare all items to all others. But choosing the right array functions will speed things up significantly. Here is a quick example (the results array could have been better):

$results = array();
$count = count($entries);
while ($count != 0) {
    # The entry to process
    $entry = array_shift($entries);
    # Get levenshtein distances to all others
    $result = array_map(
        'levenshtein',
        # array_map() needs two arrays, this one is an array consisting of
        # multiple entries of the value that we are processing
        array_fill($entry, 0, $count),
        $toCompare
    );
    $results[] = array($entry => $result);
    $count--;
}
soulmerge
+2  A: 

You can use a bk-tree to speed-up the search/comparison.

http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees says:

Now we can make a particularly useful observation about the Levenshtein Distance: It forms a Metric Space.
[...]
Assume for a moment we have two parameters, query, the string we are using in our search, and n the maximum distance a string can be from query and still be returned. Say we take an arbitary string, test and compare it to query. Call the resultant distance d. Because we know the triangle inequality holds, all our results must have at most distance d+n and at least distance d-n from test.
[...]
Tests show that searching with a distance of 1 queries no more than 5-8% of the tree, and searching with two errors queries no more than 17-25% of the tree - a substantial improvement over checking every node!

edit: But that doesn't help you with your ("12 Bird Road, Apt 6" and "12 Bird Rd. #6") problem. Only with your brute-force m*n comparison.

VolkerK
+3  A: 

...but on reflection I think a better way to group similar addresses would be to:

  1. create a database with two tables - one for the address (and a id), one for the soundexes of words or literal numbers in the address (with the foreign key of the addresses table)

  2. uppercase the address, replace anything other than [A-Z] or [0-9] with a space

  3. split the address by space, calculate the soundex of each 'word', leave anything with just digits as is and store it in the soundexes table with the foreign key of the address you started with

  4. for each address (with id $target) find the most similar addresses:

SELECT similar.id, similar.address, count(*) FROM adress similar, word cmp, word src WHERE src.address_id=$target AND src.soundex=cmp.soundex AND cmp.address_id=similar.id ORDER BY count(*) LIMIT $some_value;

  1. calculate the levenstein difference between your source address and the top few values returned by the query.

(doing any sort of operation on large arrays is often faster in databases)

C.

symcbean
I would use normalized addresses in above algoritm. That is, addresses where abbreviations have been removed etc. ( "Ave." changed to "Avenue" etc)
Juha Syrjälä
A: 

You say...

The overall goal is to find similar entries (and eliminate duplicates) based on user-submitted street addresses.

I say... use the techniques at http://semaphorecorp.com/mpdd/mpdd.html

joe snyder