views:

1167

answers:

8

I have a list of addresses in two separate tables that are slightly off that I need to be able to match. For example, the same address can be entered in multiple ways:

  • 110 Test St
  • 110 Test St.
  • 110 Test Street

Although simple, you can imagine the situation in more complex scenerios. I am trying to develop a simple algorithm that will be able to match the above addresses as a key.

For example. the key might be "11TEST" - first two of 110, first two of Test and first two of street variant. A full match key would also include first 5 of the zipcode as well so in the above example, the full key might look like "11TEST44680".

I am looking for ideas for an effective algorithm or resources I can look at for considerations when developing this. Any ideas can be pseudo code or in your language of choice.

We are only concerned with US addresses. In fact, we are only looking at addresses from 250 zip codes from Ohio and Michigan. We also do not have access to any postal software although would be open to ideas for cost effective solutions (it would essentially be a one time use). Please be mindful that this is an initial dump of data from a government source so suggestions of how users can clean it are helpful as I build out the application but I would love to have the best initial I possibly can by being able to match addresses as best as possible.

+1  A: 

In the UK we would use:

  • House Name or Number (where name includes Flat number for apartment blocks)
  • Postcode

You should certainly be using the postcode, but in the US I believe your Zip codes cover very wide areas compared to postcodes in the UK. You would therefore need to use the street and city.

Your example wouldn't differentiate between 11 Test Street, 110 - 119 Test Street, etc.

If your company has access to an address lookup system, I would run all the data through that to get the data back in a consistent format, possibly with address keys that can be used for matching.

JeeBee
Are you familiar with or have experience with any postal software that you mentioned?
sestocker
Not in the US I don't, and if your company doesn't use such a service now it won't be cost effective to arrange that for your task.
JeeBee
All of the postal software I've come across [and I've searched hard] is a). country specific; b). requires field data separated into logical fields properly; c). doesn't account for spelling mistakes and typos; d). won't match if fields have incorrect info, i.e. should've been entered as street instead of road; e). won't attempt interpolation or fuzzy matching.
BenAlabaster
Ah, the Equifax address service I have used in the past doesn't have the problems of (b) (to some extent), (c), (d) or (e) but then again UK postcodes are granular enough to make things simple.
JeeBee
A: 

If you dont chose to use an existing system, one idea is to do the following:

  • Extract numbers from the address line
  • replace common street words with blanks
  • create match string

ie: "555 Canal Street":

  • Extract number gives "555" + "Canal Street"
  • Replace street words gives "555" + "Canal"
  • Create match string gives "555Canal"

"Canal st 555" would give the same match string.

By street words i mean words and abbreviations for "street" in your language, for example "st", "st.", "blv", "ave", "avenue", etc etc all are removed from the string.

By extracting numbers and separating them from the string it does not matter if they are first or last.

Brimstedt
In your example, you'd end up matching 55 Canal Road and 55 Canal Street as being the same address.
Winston Smith
"Cleaning" may prove to be an option. I'd agree with Joe though - 55 Canal Rd and 55 Canal St should not be handled as though they are the same address. Instead of replacing "street" with a space, replace with the abbreviation "st" to get consistent addresses.
sestocker
St causes huge problems when you have to incorporate saints... address cleaning is not as trivial a task as people consider. I'm working on an algorithm at the moment too and there are _many_ subtle hangups
BenAlabaster
The mixing "road" and "street" was intentional from my side, but of course this depends on the requirements. In Sweden where I operate, we would want the equivalent of street and road to be treated the same, because exhanging those is a common way for imposters to trick softwares (and in sweden, things would most likely be delivered correctly anyway - with tehe somewhat faulty address)My intention was solely to give a hint on how she/he could solve the problem and not to provide a bullet proof solution. :-)
Brimstedt
A: 

If I was to take a crack at this I'd convert each address string into a tree using a pre-defined order of operations.

Eg. 110 Test Street Apt 3. Anywhere California 90210 =>

  1. Get the type of address. Eg Street addresses have different formats that rural route addresses and this is different by country.
  2. Given that this is a street address, get the string that represents the type of street and convert that to an enum (eBoulevard, eRoad, etc..)
  3. Given that this is a street address, pull out the street name (store in lower case)
  4. Given that this is a street address, pull out the street number
  5. Given that this is a street address, look for any apartment number (could be before the street number with a dash, could be after "Apt.", etc...)

       eStreet  //1.an enum of possible address types eg. eStreet, eRuralRoute,...
          |
       eStreet        //2.an enum of street types eg. eStreet, eBlvd, eWay,...
       /   |   \
    

    Name Number Apt | | | test 110 3

Eg. RR#3 Anywhere California 90210 =>

  1. Get the type of address: rural route
  2. Given that this is a rural route address, get the route number

       eRuralRoute 
          |
          3
    

You'll need to do something similar for country state and zip information.

Then compare the resulting trees.

This makes the comparison very simple, however, the code to generate the trees is very tricky. You'd want to test the crap out of it on thousands and thousands of addresses. Your problem is simpler if it is only US addresses you care about; British addresses as already mentioned are quite different, and Canadian address may have French in them (eg. Place D'Arms, Rue Laurent, etc...)

Robert Gowland
Just so the everyone knows, we only care about US addresses. We are actually looking at address information for only 250 zip codes on Ohio and Michigan.
sestocker
+2  A: 

I'm working on a similar algorithm as we speak, it should handle addresses in Canada, USA, Mexico and the UK by the time I'm done. The problem I'm facing is that they're in our database in a 3 field plaintext format [whoever thought that was a good idea should be shot IMHO], so trying to handle rural routes, general deliveries, large volume receivers, multiple countries, province vs. state vs. county, postal codes vs. zip codes, spelling mistakes is no small or simple task.

Spelling mistakes alone was no small feat - especially when you get to countries that use French names - matching Saint, Sainte, St, Ste, Saints, Saintes, Sts, Stes, Grand, Grande, Grands, Grandes with or without period or hyphenation to the larger part of a name cause no end of performance issues - especially when St could mean saint or street and may or may not have been entered in the correct context (i.e. feminine vs. masculine). What if the address has largely been entered correctly but has an incorrect province or postal code?

One place to start your search is the Levenstein Distance Algorithm which I've found to be really useful for eliminating a large portion of spelling mistakes. After that, it's mostly a case of searching for keywords and comparing against a postal database.

I would be really interested in collaborating with anyone that is currently developing tools to do this, perhaps we can assist each other to a common solution. I'm already part of the way there and have overcome all the issues I've mentioned so far, having someone else working on the same problem would be really helpful to bounce ideas off.

Cheers - [ben at afsinc dot ca]

BenAlabaster
Thanks for the link John
BenAlabaster
No problem ... just trying to populute every page with my glorious avatar image. ;-)
John MacIntyre
A: 

use an identity for the primary key, this will always be unique and will make it easier to merge duplicates later.

force proper data entry with the user interface. Make them enter each component in its own text box. The house number is entered in own box, the street name in its own box, city in own box, state from select list, etc.. This will make looking for matches easier

have a two process "save"

  • after initial save, do a search to look up matches, present them with list of possible matches as well as the new one.
  • after they select the new one save it, if they pick an existing one use that ID

clean the data. Try to strip out "street", "st", "drive", etc and store it as a StreetType char(1) that uses a FK to a table containing the proper abbreviations, so you can build the street.

look into SOUNDEX and DIFFERENCE

I have worked at large companies that maintain mailinig lists, and they did not attempt to do it automatically, they used people to filter out the new from the dups because it is so hard to do. Plan for a merge feature so you can manually merge duplicates when they occur, and ripple the values through the PKs.

You might look into the google maps api and see if you can pass in you address and get a match back. I'm not familiar with it, this is just speculation.

KM
Very interesting ideas in dealing with the data. This initial dump of data we have no control over (its from the US government actually) but the subsequent use will need the merge and delete operations you mentioned.
sestocker
soundex and difference both have huge performance and integrity issues. Using a key is all good for comparison of correct data, but when information starts getting entered incorrectly, the key is all but useless except for finding the record... which may or may not be correct.
BenAlabaster
you only need to use SOUNDEX and DIFFERENCE on initial entry to help find any matches. I'm sure a little dip in performance would be better than having duplicate addresses in the DB
KM
+1  A: 

If you would prefer tonot develop one and rather use an off-the-shelf product that uses many of the technologies mentioned here, see: http://www.melissadata.com/dqt/matchup-api.htm

Disclaimer: I had a role in its development and work for the company.

Marc Bernier
A: 

If it is cost-effective for your company to write its own address normalization tool then I'd suggest starting with the USPS address standard. Alternatively, there are any number of vendors offering server side tools and web services to normalize, correct and verify addresses.

My company uses AccuMail Gold for this purpose because it does a lot more than just standardize & correct the address. When we considered the cost of even one week's worth of salary to develop a tool in-house the choice to buy an off-the-shelf product was obvious.

Handcraftsman
A: 

The standardize/merge/purge procedures you need to follow are described at http://semaphorecorp.com/mpdd/mpdd.html

joe snyder