ansaurus

Question

Answer 1

+1 A:

In the UK we would use:

House Name or Number (where name includes Flat number for apartment blocks)
Postcode

You should certainly be using the postcode, but in the US I believe your Zip codes cover very wide areas compared to postcodes in the UK. You would therefore need to use the street and city.

Your example wouldn't differentiate between 11 Test Street, 110 - 119 Test Street, etc.

If your company has access to an address lookup system, I would run all the data through that to get the data back in a consistent format, possibly with address keys that can be used for matching.

JeeBee 2009-05-05 12:22:04

Are you familiar with or have experience with any postal software that you mentioned?

sestocker 2009-05-05 12:56:24

Not in the US I don't, and if your company doesn't use such a service now it won't be cost effective to arrange that for your task.

JeeBee 2009-05-05 13:00:17

All of the postal software I've come across [and I've searched hard] is a). country specific; b). requires field data separated into logical fields properly; c). doesn't account for spelling mistakes and typos; d). won't match if fields have incorrect info, i.e. should've been entered as street instead of road; e). won't attempt interpolation or fuzzy matching.

BenAlabaster 2009-05-05 14:25:48

Ah, the Equifax address service I have used in the past doesn't have the problems of (b) (to some extent), (c), (d) or (e) but then again UK postcodes are granular enough to make things simple.

JeeBee 2009-05-05 15:41:23

Answer 2

A:

If you dont chose to use an existing system, one idea is to do the following:

Extract numbers from the address line
replace common street words with blanks
create match string

ie: "555 Canal Street":

Extract number gives "555" + "Canal Street"
Replace street words gives "555" + "Canal"
Create match string gives "555Canal"

"Canal st 555" would give the same match string.

By street words i mean words and abbreviations for "street" in your language, for example "st", "st.", "blv", "ave", "avenue", etc etc all are removed from the string.

By extracting numbers and separating them from the string it does not matter if they are first or last.

Brimstedt 2009-05-05 12:29:35

In your example, you'd end up matching 55 Canal Road and 55 Canal Street as being the same address.

Winston Smith 2009-05-05 12:36:18

"Cleaning" may prove to be an option. I'd agree with Joe though - 55 Canal Rd and 55 Canal St should not be handled as though they are the same address. Instead of replacing "street" with a space, replace with the abbreviation "st" to get consistent addresses.

sestocker 2009-05-05 12:40:09

St causes huge problems when you have to incorporate saints... address cleaning is not as trivial a task as people consider. I'm working on an algorithm at the moment too and there are _many_ subtle hangups

BenAlabaster 2009-05-05 14:08:57

The mixing "road" and "street" was intentional from my side, but of course this depends on the requirements. In Sweden where I operate, we would want the equivalent of street and road to be treated the same, because exhanging those is a common way for imposters to trick softwares (and in sweden, things would most likely be delivered correctly anyway - with tehe somewhat faulty address)My intention was solely to give a hint on how she/he could solve the problem and not to provide a bullet proof solution. :-)

Brimstedt 2009-05-08 06:14:02

Answer 3

A:

If I was to take a crack at this I'd convert each address string into a tree using a pre-defined order of operations.

Eg. 110 Test Street Apt 3. Anywhere California 90210 =>

Get the type of address. Eg Street addresses have different formats that rural route addresses and this is different by country.
Given that this is a street address, get the string that represents the type of street and convert that to an enum (eBoulevard, eRoad, etc..)
Given that this is a street address, pull out the street name (store in lower case)
Given that this is a street address, pull out the street number
Given that this is a street address, look for any apartment number (could be before the street number with a dash, could be after "Apt.", etc...)
```
   eStreet  //1.an enum of possible address types eg. eStreet, eRuralRoute,...
      |
   eStreet        //2.an enum of street types eg. eStreet, eBlvd, eWay,...
   /   |   \
```
Name Number Apt | | | test 110 3

Eg. RR#3 Anywhere California 90210 =>

Get the type of address: rural route
Given that this is a rural route address, get the route number
```
   eRuralRoute 
      |
      3
```

You'll need to do something similar for country state and zip information.

Then compare the resulting trees.

This makes the comparison very simple, however, the code to generate the trees is very tricky. You'd want to test the crap out of it on thousands and thousands of addresses. Your problem is simpler if it is only US addresses you care about; British addresses as already mentioned are quite different, and Canadian address may have French in them (eg. Place D'Arms, Rue Laurent, etc...)

Robert Gowland 2009-05-05 13:45:08

Just so the everyone knows, we only care about US addresses. We are actually looking at address information for only 250 zip codes on Ohio and Michigan.

sestocker 2009-05-05 13:57:37

Answer 4

+2 A:

I'm working on a similar algorithm as we speak, it should handle addresses in Canada, USA, Mexico and the UK by the time I'm done. The problem I'm facing is that they're in our database in a 3 field plaintext format [whoever thought that was a good idea should be shot IMHO], so trying to handle rural routes, general deliveries, large volume receivers, multiple countries, province vs. state vs. county, postal codes vs. zip codes, spelling mistakes is no small or simple task.

Spelling mistakes alone was no small feat - especially when you get to countries that use French names - matching Saint, Sainte, St, Ste, Saints, Saintes, Sts, Stes, Grand, Grande, Grands, Grandes with or without period or hyphenation to the larger part of a name cause no end of performance issues - especially when St could mean saint or street and may or may not have been entered in the correct context (i.e. feminine vs. masculine). What if the address has largely been entered correctly but has an incorrect province or postal code?

One place to start your search is the Levenstein Distance Algorithm which I've found to be really useful for eliminating a large portion of spelling mistakes. After that, it's mostly a case of searching for keywords and comparing against a postal database.

I would be really interested in collaborating with anyone that is currently developing tools to do this, perhaps we can assist each other to a common solution. I'm already part of the way there and have overcome all the issues I've mentioned so far, having someone else working on the same problem would be really helpful to bounce ideas off.

Cheers - [ben at afsinc dot ca]

BenAlabaster 2009-05-05 13:52:57

Thanks for the link John

BenAlabaster 2009-05-05 14:12:59

No problem ... just trying to populute every page with my glorious avatar image. ;-)

John MacIntyre 2009-05-05 14:29:46

Answer 5

A:

use an identity for the primary key, this will always be unique and will make it easier to merge duplicates later.

force proper data entry with the user interface. Make them enter each component in its own text box. The house number is entered in own box, the street name in its own box, city in own box, state from select list, etc.. This will make looking for matches easier

have a two process "save"

after initial save, do a search to look up matches, present them with list of possible matches as well as the new one.
after they select the new one save it, if they pick an existing one use that ID

clean the data. Try to strip out "street", "st", "drive", etc and store it as a StreetType char(1) that uses a FK to a table containing the proper abbreviations, so you can build the street.

look into SOUNDEX and DIFFERENCE

I have worked at large companies that maintain mailinig lists, and they did not attempt to do it automatically, they used people to filter out the new from the dups because it is so hard to do. Plan for a merge feature so you can manually merge duplicates when they occur, and ripple the values through the PKs.

You might look into the google maps api and see if you can pass in you address and get a match back. I'm not familiar with it, this is just speculation.

KM 2009-05-05 13:56:11

Very interesting ideas in dealing with the data. This initial dump of data we have no control over (its from the US government actually) but the subsequent use will need the merge and delete operations you mentioned.

sestocker 2009-05-05 14:01:49

soundex and difference both have huge performance and integrity issues. Using a key is all good for comparison of correct data, but when information starts getting entered incorrectly, the key is all but useless except for finding the record... which may or may not be correct.

BenAlabaster 2009-05-05 14:27:05

you only need to use SOUNDEX and DIFFERENCE on initial entry to help find any matches. I'm sure a little dip in performance would be better than having duplicate addresses in the DB

KM 2009-05-05 20:07:33

Answer 6

+1 A:

If you would prefer tonot develop one and rather use an off-the-shelf product that uses many of the technologies mentioned here, see: http://www.melissadata.com/dqt/matchup-api.htm

Disclaimer: I had a role in its development and work for the company.

Marc Bernier 2009-05-05 15:47:27

Answer 7

A:

If it is cost-effective for your company to write its own address normalization tool then I'd suggest starting with the USPS address standard. Alternatively, there are any number of vendors offering server side tools and web services to normalize, correct and verify addresses.

My company uses AccuMail Gold for this purpose because it does a lot more than just standardize & correct the address. When we considered the cost of even one week's worth of salary to develop a tool in-house the choice to buy an off-the-shelf product was obvious.

Handcraftsman 2010-05-21 14:55:50

Answer 8

A:

The standardize/merge/purge procedures you need to follow are described at http://semaphorecorp.com/mpdd/mpdd.html

joe snyder 2010-06-08 16:47:55

ansaurus

tags:

views:

answers:

Address Match Key Algorithm

related questions