views:

40

answers:

3

We are working on clean-up and analysis of a lot of human-entered customer data. We need to decide programmatically whether 2 addresses (for example) are the same, even though the data was entered with slight variations.

Right now we run each address through fairly simplistic string replacement (replacing avenue with ave, for example), concatenate the fields and compare the results. We are doing something similar with names.

At the very least, it seems like our list of search-replace values should already exist somewhere.

Or perhaps you can suggest a totally different and superior way to detect matches?

+1  A: 

Soundex and its variants might be a good start as are other approaches suggested by that Wikipedia page.

msw
+3  A: 

For the addresses, you should run them through google's map api and get a geocode for each one. Then if the geocodes are the same, the place is the same. I believe they allow 10k hits/day/ip for free.

It's unlikely that you'd come up with anything better on your own.

http://code.google.com/apis/maps/

fastmultiplication
thanks, i think this will be really useful!
anyaelena
A: 

Essentially you're trying to find how similar two strings are and there are a lot of different ways to measure it. Dice Coefficients could work fairly well for what you're doing, although it is a bit costly of an operation.

http://en.wikipedia.org/wiki/Dice_coefficient

If you want a more comprehensive list of string similarity measures try here: http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

Jacob Schlather
perhaps i'm missing something, but aren't "234 5th avenue, 2nd floor, new york NY 10002" and "234 7th avenue, 2nd floor, new york NY 10002" very similar strings but distinct addresses?
anyaelena