views:

261

answers:

6

I'm trying to avoid reinventing the wheel when it comes to storing street addresses in a table only once. Uniqueness constraints won't work in some common situations:

100 W 5th Ave
100 West 5th Ave
100 W 5th

200 N 6th Ave Suite 405
200 N 6th Ave  #405

I could implement some business logic or a trigger to normalize all fields before inserting and use uniqueness constraints across several fields in the table, but it would be easy to miss some cases with something that varies as much as street addresses.

What would be best would be a universal identifier for each address, perhaps based on GPS coordinates. Before storing a new address look up its GUID and see if the GUID already exists in the Address table.

An organization like Mapquest, the Postal Serice, FedEx, or the US government probably has a system like this.

Has anyone found a good solution to this?

Here's my Address table now (generated by JPA):

CREATE TABLE address
(
  id bigint NOT NULL,
  "number" character varying(255),
  dir character varying(255),
  street character varying(255),
  "type" character varying(255),
  trailingdir character varying(255),
  unit character varying(255),
  city character varying(255),
  state character varying(255),
  zip integer,
  zip4 integer,
  CONSTRAINT address_pkey PRIMARY KEY (id)
)
+3  A: 

Look up the address in Google maps and use the spelling they use.

ammoQ
A: 

You need support for regular expressions like syntax. You can come up with some kind of automata function that will parse tokens and try matching them and then expand or contract them into abbreviations. I'd look into glob() like functions that give support to *? etc on unix as a quick dirty fix.

Dmitry
A: 

This guy gives a full implementation of the code to validate an address with USPS online.

Ron Harlev
And is there any approach for the rest of the world *NOT* living in the US?
marc_s
A: 

I wasn't looking for address validation or normalization, although address validation is a good idea. I need a unique identifier for each street address to avoid duplicate records.

It looks like geocoding can provide a solution. With geocoding the input can be a street address and the output will be latitude and longitude coordinates with enough precision to resolve a specific building.

There's a more serious problem with street address ambiguity than I thought. This is from the Wikipedia page on geocoding:

"...there are multiple 100 Washington Streets in Boston, Massachusetts because several cities have been annexed without changing street names."

The Wikipedia page on geocoding has a list of resources (many free) to perform geocoding.

Dean Schulze
A: 

I settled on the USC WebGIS service due to their nice web service interface and being easy to sign up for.

Geocodes aren't suitable as a unique key for street addresses, though, for a number of reasons. For example, geocoding cannot distinguish between different units in a condominium complex or apartment building.

I decided to use the parsed address from the geocoding result and put unique constraints on the street number, street name, unit, city, state, and zip. It's not perfect, but it works for what I'm doing.

Dean Schulze
+2  A: 

First, carefully reconsider why you feel compelled to store addresses only once and identify them by a unique ID. It adds complexity, fights the changing nature and wide variety of addresses, and may not actually correctly deal with the real problem you're actually trying to solve. See http://semaphorecorp.com/mpdd/mpdd.html regarding duplicate address issues.

The following creates a 35-character identifier that uniquely identifies U.S. addresses THAT RECEIVE MAIL:

House or PO box number [USPS 10 characters maximum]
Optional unit abbreviation [APT/STE/etc, USPS 4 characters maximum]
Optional apartment number [USPS 8 characters maximum]
ZIP code [USPS 5 characters]
+4 code [USPS 4 characters]
MMYR [month year, 4 characters]

The unit abbreviation is required to distinguish (rare) cases like STE 1 and APT 1 in the same building (see Publication 28 at usps.com for the list of all unit types). +4 codes are unavailable for addresses (typically rural) that don't receive mail delivery (eg, they use a PO box at the post office), so you can't create an ID for those addresses. If you wish to distinguish private mail box (PMB) drops at places like a UPS store, you'll have to add the PMB number, but since PMBs are controlled by the stores not the Postal Service the number of characters necessary is unpredictable (although 4 or 5 chars should be enough). The identifier will be guaranteed unique only within the given USPS ZIP+4 database month/year edition, since the same address might have a different +4 code or other component in a different month.

joe snyder