views:

286

answers:

6

Hello, I'm trying to come up with a method of finding duplicate addresses, based on a similarity score. Consider these duplicate addresses:

addr_1 = '# 3 FAIRMONT LINK SOUTH'
addr_2 = '3 FAIRMONT LINK S'

addr_3 = '5703 - 48TH AVE'
adrr_4 = '5703- 48 AVENUE'

I'm planning on applying some string transformation to make long words abbreviated, like NORTH -> N, remove all spaces, commas and dashes and pound symbols. Now, having this output, how can I compare addr_3 with the rest of addresses and detect similar? What percentage of similarity would be safe? Could you provide a simple python code for this?

addr_1 = '3FAIRMONTLINKS'
addr_2 = '3FAIRMONTLINKS'

addr_3 = '570348THAV'
adrr_4 = '570348AV'

Thankful,

Eduardo

+2  A: 

Removing spaces, commas and dashes will be ambiguous . It will be better to replace them with a single space.

Take for example this address

56 5th avenue

And this

5, 65th avenue

with your method both of them will be:

565THAV

What you can do is write a good address shortening algorithm and then use string comparison to detect duplicates. This should be enough to detect duplicates in the general case. A general similarity algorithm won't work. Because one number difference can mean a huge change in Addresses.

The algorithm can go like this:

  1. replace all commas dashes with spaces. Use he translate method for that.
  2. Build a dictionary with words and their abbreviated form
  3. Remove the TH part if it was following a number.
Nadia Alramli
Thanks for pointing me out about the flaw in eliminating spaces altogether, I hadn't foreseen this side effect
Eduardo
+2  A: 

This should be helpful in building your dictionary of abbreviations:

http://www.usps.com/ncsc/lookups/usps%5Fabbreviations.html

delparnel
Thanks for the link, it will definitely help in building my list or dictionary
Eduardo
+1  A: 

First, simplify the address string by collapsing all whitespace to a single space between each word, and forcing everything to lower case (or upper case if you prefer):

adr = " ".join(adr.tolower().split())

Then, I would strip out things like "st" in "41st Street" or "nd" in "42nd Street":

adr = re.sub("1st(\b|$)", r'1', adr)
adr = re.sub("([2-9])\s?nd(\b|$)", r'\1', adr)

Note that the second sub() will work with a space between the "2" and the "nd", but I didn't set the first one to do that; because I'm not sure how you can tell the difference between "41 St Ave" and "41 St" (that second one is "41 Street" abbreviated).

Be sure to read all the help for the re module; it's powerful but cryptic.

Then, I would split what you have left into a list of words, and apply the Soundex algorithm to list items that don't look like numbers:

http://en.wikipedia.org/wiki/Soundex

http://wwwhomes.uni-bielefeld.de/gibbon/Forms/Python/SEARCH/soundex.html

adrlist = [word if word.isdigit() else soundex(word) for word in adr.split()]

Then you can work with the list or join it back to a string as you think best.

The whole idea of the Soundex thing is to handle misspelled addresses. That may not be what you want, in which case just ignore this Soundex idea.

Good luck.

steveha
Thanks, I will take a look at the links about soundex. I'll also take a look on the Levenshtein Python extension and C library, for fuzzy matching. Levenshtein Python extension and C library.http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison
Eduardo
Soundex is a very English-specific algorithm. If you deal with street names derived from other languages, you may find it produces many false positives.
John Fouhy
Soundex is t3h p0x ... it produces many false positives and false negatives with English names. If you are brave enough to rely on phonetic similarity, use a more modern method like (Double) Metaphone. If you expect most errors to be caused by reading or typing problems, use Levenshtein distance or better. For best results, use a combination. If you have any sizable volume, use a C extension -- naive Levenshtein implementations are O(N**2).
John Machin
A: 

I had to do this once. I converted everything to lowercase, computed each address's Levenshtein distance to every other address, and ordered the results. It worked very well, but it was quite time-consuming.

You'll want to use an implementation of Levenshtein in C rather than in Python if you have a large data set. Mine was a few tens of thousands and took the better part of a day to run, I think.

Steven Huwig
A: 

I regularly inspect addresses for duplication where I work, and I have to say, I find Soundex highly unsuitable. It's both too slow and too eager to match things. I have similar issues with Levenshtein distance.

What has worked best for me is to sanitize and tokenize the addresses (get rid of punctuation, split things up into words) and then just see how many tokens match up. Because addresses typically have several tokens, you can develop a level of confidence in terms of a combination of (1) how many tokens were matched, (2) how many numeric tokens were matched, and (3) how many tokens are available. For example, if all tokens in the shorter address are in the longer address, the confidence of a match is pretty high. Likewise, if you match 5 tokens including at least one that's numeric, even if the addresses each have 8, that's still a high-confidence match.

It's definitely useful to do some tweaking, like substituting some common abbreviations. The USPS lists help, though I wouldn't go gung-ho trying to implement all of them, and some of the most valuable substitutions aren't on those lists. For example, 'JFK' should be a match for 'JOHN F KENNEDY', and there are a number of common ways to shorten 'MARTIN LUTHER KING JR'.

Maybe it goes without saying but I'll say it anyway, for completeness: Don't forget to just do a straight string comparison on the whole address before messing with more complicated things! This should be a very cheap test, and thus is probably a no-brainer first pass.

Obviously, the more time you're willing and able to spend (both on programming/testing and on run time), the better you'll be able to do. Fuzzy string matching techniques (faster and less generalized kinds than Levenshtein) can be useful, as a separate pass from the token approach (I wouldn't try to fuzzy match individual tokens against each other). I find that fuzzy string matching doesn't give me enough bang for my buck on addresses (though I will use it on names).

John Y
+1  A: 

In order to do this right, you need to standardize your addresses according to USPS standards (your address examples appear to be US based). There are many direct marketing service providers that offer CASS (Coding Accuracy Support System) certification of postal addresses. The CASS process will standardize all of your addresses and append zip + 4 to them. Any undeliverable addresses will be flagged which will further reduce your postal mailing costs, if that is your intent. Once all of your addresses are standardized, eliminating duplicates will be trivial.

galaxywatcher
+1 for the technique. Such an address standardisation and assignment of a unique ID service is available in many countries. However I'm not sure what you mean by "eliminating" duplicates. The unique ID identifies a postal delivery point. These may be common to many organisations (thousands of small companies at an accounting firm's address) or many people (jail, hospital, military base, mail agent in rural community) or a few people (most residences). Eliminating duplicate addresses appears relevant only in a one-off one-per-household mailout where you don't care about the bulk receivers.
John Machin
The OP is looking for "a method of finding duplicate addresses". The primary reason, in my experience, for people to find duplicate addresses, is so that they have the option of sending only one to the address. Presumably, if there are duplicate addresses, there will be duplicate names, in which case they most certainly will want to eliminate duplicate records (duplicate names AND addresses).
galaxywatcher