views:

750

answers:

11

How do I find duplicate addresses in a database, or better stop people already when filling in the form ? I guess the earlier the better?

Is there any good way of abstracting street, postal code etc so that typos and simple attempts to get 2 registrations can be detected? like:

Quellenstrasse 66/11 
Quellenstr. 66a-11

I'm talking German addresses... Thanks!

A: 

Often you use constraints in a database to ensure data to be "unique" in the data-based sense.

Regarding "isomorphisms" I think you are on your own, ie writing the code your self. If in the database you could use a trigger.

svrist
A: 

@svrist: thanks for your answer. I can't imangine being on my own with this. Surely I'm not the only developer trying to prevent duplicate addresses...?

Johannes
+3  A: 

The earlier you can stop people, the easier it'll be in the long run!

Not being too familiar with your db schema or data entry form, I'd suggest a route something like the following:

  • have distinct fields in your db for each address "part", e.g. street, city, postal code, Länder, etc.

  • have your data entry form broken down similarly, e.g. street, city, etc

The reasoning behind the above is that each part will likely have it's own particular "rules" for checking slightly-changed addressed, ("Quellenstrasse"->"Quellenstr.", "66/11"->"66a-11" above) so your validation code can check if the values as presented for each field exist in their respective db field. If not, you can have a class that applies the transformation rules for each given field (e.g. "strasse" stemmed to "str") and checks again for duplicates.

Obviously the above method has it's drawbacks:

  • it can be slow, depending on your data set, leaving the user waiting

  • users may try to get around it by putting address "Parts" in the wrong fields (appending post code to city, etc). but from experience we've found that introducing even simple checking like the above will prevent a large percentage of users from entering pre-existing addresses.

Once you've the basic checking in place, you can look at optimising the db accesses required, refining the rules, etc to meet your particular schema. You might also take a look at MySQL's match() function for working out similar text.

ConroyP
+6  A: 

You could use the Google GeoCode API

Wich in fact gives results for both of your examples, just tried it. That way you get structured results that you can save in your database. If the lookup fails, ask the user to write the address in another way.

Espo
A: 

@Espo: thanks interesting idea, I'll definitly look at this. I might use it for another feature as well. Have to list addresses of companies close to the one enter by the user.

Johannes
A: 

@PConroy: This was my initial thougt also. the interesting part on this is to find good transformation rules for the different parts of the address! Any good suggestions?

Johannes
+1  A: 

Before you start searching for duplicate addresses in your database, you should first make sure you store the addresses in a standard format.

Most countries have a standard way of formatting addresses, in the US it's the USPS CASS system: http://www.usps.com/ncsc/addressservices/certprograms/cass.htm

But most other countries have a similar service/standard. Try this site for more international formats: http://bitboost.com/ref/international-address-formats.html

This not only helps in finding duplicates, but also saves you money when mailing you customers (the postal service charges less if the address is in a standard format).

Depending on your application, in some cases you might want to store a "vanity" address record as well as the standard address record. This keeps your VIP customers happy. A "vanity" address might be something like:

62 West Ninety First Street
Apartment 4D
Manhattan, New York, NY 10001

While the standard address might look like this:

62 W 91ST ST APT 4D
NEW YORK NY 10024-1414

urini
+1  A: 

One thing you might want to look at are Soundex searches, which are quite useful for misspellings and contractions.

This however is not an in-database validation so it may or may not be what you're looking for.

Jon Limjap
+1  A: 

Johannes:

@PConroy: This was my initial thougt also. the interesting part on this is to find good transformation rules for the different parts of the address! Any good suggestions?

When we were working on this type of project before, our approach was to take our existing corpus of addresses (150k or so), then apply the most common transformations for our domain (Ireland, so "Dr"->"Drive", "Rd"->"Road", etc). I'm afraid there was no comprehensive online resource for such things at the time, so we ended up basically coming up with a list ourselves, checking things like the phone book (pressed for space there, addresses are abbreviated in all manner of ways!). As I mentioned earlier, you'd be amazed how many "duplicates" you'll detect with the addition of only a few common rules!

I've recently stumbled across a page with a fairly comprehensive list of address abbreviations, although it's american english, so I'm not sure how useful it'd be in Germany! A quick google turned up a couple of sites, but they seemed like spammy newsletter sign-up traps. Although that was me googling in english, so you may have more look with "german address abbreviations" in german :)

ConroyP
+1  A: 

To add an answer to my own question:

A different way of doing it is ask users for their mobile phone number, send them a text msg for verification. This stops most people messing with duplicate addresses.

I'm talking from personal experience. (thanks pigsback !) They introduced confirmation through mobile phone. That stopped me having 2 accounts! :-)

Johannes
+1  A: 

Another possible solution (assuming you actually need reliable address data and you're not just using addresses as a way to prevent duplicate accounts) is to use a third-party web service to standardize the addresses provided by your users.

It works this way -- your system accepts a user's address via an online form. Your form hands off the user's address to the third-party address standardization web service. The web service gives you back the same address but now with the data standardized into discrete address fields, and with the standard abbreviations and formats applied. Your application displays this standardized address to your user for their confirmation before attempting to save the data in your DB.

If all the user addresses go through a standardization step and only standardized addresses are saved to your DB, then finding duplicate records should be greatly simplified since you are now comparing apples to apples.

One such third-party service is Global Address's Interactive Service which includes Germany in the list of supported countries, and also has an online demo that demonstrates how their service works (demo link can be found on that web page).

There's a cost disadvantage to this approach, obviously. However, on the plus side:

  1. you would not need to create and maintain your own address standardization metadata
  2. you won't need to continuously enhance your address standardization routines, and
  3. you're free to focus your software development energy on the parts of the application that are unique to your requirements

Disclaimer: I don't work for Global Address and have not tried using their service. I'm merely mentioning them as an example since they have an online demo that you can actually play with.

mdy