ansaurus

Question

Parse usable Street Address, City, State, Zip from a string

Answer 1

A:

This won't solve your problem, but if you only needed lat/long data for these addresses, the Google Maps API will parse non-formatted addresses pretty well.

pix0r 2008-08-19 15:40:15

Answer 2

+3 A:

Are there any standards at all in the way that the addresses are recorded? For example:

Are there always commas or new-lines separating street1 from street2 from city from state from zip?
Are address types (road, street, boulevard, etc) always spelled out? always abbreviated? Some of each?
Define "error".

My general answer is a series of Regular Expressions, though the complexity of this depends on the answer. And if there is no consistency at all, then you may only be able to achieve partial success with a Regex (ie: filtering out zip code and state) and will have to do the rest by hand (or at least go through the rest very carefully to make sure you spot the errors).

Yaakov Ellis 2008-08-19 15:40:18

Answer 3

A:

Couple of questions: 1. Any delimiters? 2. What is the field order in the string? 3. What behavior do you want in the case of a data error (E.G. shove address into a single field in the SQL table, leave others blank)

Jay Mooney 2008-08-19 15:41:21

Answer 4

+29 A:

I've done a lot of work on this kind of parsing. Because there are errors you won't get 100% accuracy, but there are a few things you can do to get most of the way there, and then do a visual BS test. Here's the general way to go about it. It's not code, because it's pretty academic to write it, there's no weirdness, just lots of string handling.

(Now that you've posted some sample data, I've made some minor changes)

Work backward. Start from the zip code, which will be near the end, and in one of two known formats: XXXXX or XXXXX-XXXX. If this doesn't appear, you can assume you're in the city, state portion, below.
The next thing, before the zip, is going to be the state, and it'll be either in a two-letter format, or as words. You know what these will be, too -- there's only 50 of them. Also, you could soundex the words to help compensate for spelling errors.
before that is the city, and it's probably on the same line as the state. You could use a zip-code database to check the city and state based on the zip, or at least use it as a BS detector.
The street address will generally be one or two lines. The second line will generally be the suite number if there is one, but it could also be a PO box.
It's going to be near-impossible to detect a name on the first or second line, though if it's not prefixed with a number (or if it's prefixed with an "attn:" or "attention to:" it could give you a hint as to whether it's a name or an address line.

I hope this helps somewhat.

Tim Sullivan 2008-08-19 15:52:05

Answer 5

A:

There are data services that given a zip code will give you list of street names in that zip code.

Use a regex to extract Zip or City State - find the correct one or if a error get both. pull the list of streets from a data source Correct the city and state, and then street address. Once you get a valid Address line 1, city, state, and zip you can then make assumptions on address line 2..3

AdamSane 2008-08-19 15:52:48

Answer 6

+3 A:

Based on the sample data:

I would start at the end of the string. Parse a Zip-code (either format). Read end to first space. If no Zip Code was found Error.
Trim the end then for spaces and special chars (commas)
Then move on to State, again use the Space as the delimiter. Maybe use a lookup list to validate 2 letter state codes, and full state names. If no valid state found, error.
Trim spaces and commas from the end again.
City gets tricky, I would actually use a comma here, at the risk of getting too much data in the city. Look for the comma, or beginning of the line.
If you still have chars left in the string, shove all of that into an address field.

This isn't perfect, but it should be a pretty good starting point.

Jay Mooney 2008-08-19 15:55:41

Answer 7

+3 A:

Another request for sample data.

As has been mentioned I would work backwards from the zip.

Once you have a zip I would query a zip database, store the results, and remove them & the zip from the string.

That will leave you with the address mess. MOST (All?) addresses will start with a number so find the first occurrence of a number in the remaining string and grab everything from it to the (new) end of the string. That will be your address. Anything to the left of that number is likely an addressee.

You should now have the City, State, & Zip stored in a table and possibly two strings, addressee and address. For the address, check for the existence of "Suite" or "Apt." etc. and split that into two values (address lines 1 & 2).

For the addressee I would punt and grab the last word of that string as the last name and put the rest into the first name field. If you don't want to do that, you'll need to check for salutation (Mr., Ms., Dr., etc.) at the start and make some assumptions based on the number of spaces as to how the name is made up.

I don't think there's any way you can parse with 100% accuracy.

Chuck 2008-08-19 16:03:28

Answer 8

+6 A:

I've done this in the past.

Either do it manually, (build a nice gui that helps the user do it quickly) or have it automated and check against a recent address database (you have to buy that) and manually handle errors.

Manual handling will take about 10 seconds each, meaning you can do 3600/10 = 360 per hour, so 4000 should take you approximately 11-12 hours. This will give you a high rate of accuracy.

For automation, you need a recent US address database, and tweak your rules against that. I suggest not going fancy on the regex (hard to maintain long-term, so many exceptions). Go for 90% match against the database, do the rest manually.

Do get a copy of Postal Addressing Standards (USPS) at http://pe.usps.gov/cpim/ftp/pubs/Pub28/pub28.pdf and notice it is 130+ pages long. Regexes to implement that would be nuts.

For international addresses, all bets are off. US-based workers would not be able to validate.

Alternatively, use a data service. I have, however, no recommendations.

Furthermore: when you do send out the stuff in the mail (that's what it's for, right?) make sure you put "address correction requested" on the envelope (in the right place) and update the database. (We made a simple gui for the front desk person to do that; the person who actually sorts through the mail)

Finally, when you have scrubbed data, look for duplicates.

Christopher Mahan 2008-08-19 16:04:45

Answer 9

A:

You're all overthinking this

a) Split on commas.

b) If the name appears in the first line of the address either :

If the name exists in another field (Likely), remove the name from the first address line (using a replace with Blank).
If the name doesn't appear anywhere else, strip to the first numeric character if the length of that first field is > 40 characters.

You'll end up with some errors, but export the whole lot to Excel and get the customer to tidy it up. Because most addresses will be correct, they'll be able to quickly get the data correctly. i.e. 4000 lines should be sortable in a couple of hours.

seanyboy 2008-08-19 16:36:19

Answer 10

+3 A:

If it's human entered data, then you'll spend too much time trying to code around the exceptions.

Try:

Regular expression to extract the zip code
Zip code lookup (via appropriate government DB) to get the correct address
Get an intern to manually verify the new data matches the old

engtech 2008-08-19 16:47:16

Answer 11

+5 A:

This won't solve your problem, but if you only needed lat/long data for these addresses, the Google Maps API will parse non-formatted addresses pretty well.

Good suggestion, alternatively you can execute a CURL request for each address to Google Maps and it will return the properly formatted address. From that, you can regex to your heart's content.

Kevin 2008-08-19 16:51:08

Answer 12

+5 A:

I've been working in the address processing domain for about 5 years now, and there really is no silver bullet. The correct solution is going to depend on the value of the data. If it's not very valuable, throw it through a parser as the other answers suggest. If it's even somewhat valuable you'll definitely need to have a human evaluate/correct all the results of the parser. If you're looking for a fully automated, repeatable solution, you probably want to talk to a address correction vendor like Group1 or Trillium.

Nicholas Trandem 2008-08-19 17:08:25

Answer 13

+3 A:

After the advice here, I have devised the following function in VB which creates passable, although not always perfect (if a company name and a suite line are given, it combines the suite and city) usable data. Please feel free to comment/refactor/yell at me for breaking one of my own rules, etc.:

    Public Function parseAddress(ByVal input As String) As Collection
      input = input.Replace(",", "")
      input = input.Replace("  ", " ")
      Dim splitString() As String = Split(input)
      Dim streetMarker() As String = New String() {"street", "st", "st.", "avenue", "ave", "ave.", "blvd", "blvd.", "highway", "hwy", "hwy.", "box", "road", "rd", "rd.", "lane", "ln", "ln.", "circle", "circ", "circ.", "court", "ct", "ct."}
      Dim address1 As String
      Dim address2 As String = ""
      Dim city As String
      Dim state As String
      Dim zip As String
      Dim streetMarkerIndex As Integer

      zip = splitString(splitString.Length - 1).ToString()
      state = splitString(splitString.Length - 2).ToString()
      streetMarkerIndex = getLastIndexOf(splitString, streetMarker) + 1
      Dim sb As New StringBuilder

      For counter As Integer = streetMarkerIndex To splitString.Length - 3
        sb.Append(splitString(counter) + " ")
      Next counter
      city = RTrim(sb.ToString())
      Dim addressIndex As Integer = 0

      For counter As Integer = 0 To streetMarkerIndex
        If IsNumeric(splitString(counter)) _
        Or splitString(counter).ToString.ToLower = "po" _
        Or splitString(counter).ToString().ToLower().Replace(".", "") = "po" Then
            addressIndex = counter
            Exit For
        End If
      Next counter

      sb = New StringBuilder
      For counter As Integer = addressIndex To streetMarkerIndex - 1
        sb.Append(splitString(counter) + " ")
      Next counter

      address1 = RTrim(sb.ToString())

      sb = New StringBuilder

      If addressIndex = 0 Then
          If splitString(splitString.Length - 2).ToString() <> splitString(streetMarkerIndex + 1) Then
              For counter As Integer = streetMarkerIndex To splitString.Length - 2
                sb.Append(splitString(counter) + " ")
              Next counter
          End If
      Else
          For counter As Integer = 0 To addressIndex - 1
            sb.Append(splitString(counter) + " ")
          Next counter
      End If
      address2 = RTrim(sb.ToString())

      Dim output As New Collection
      output.Add(address1, "Address1")
      output.Add(address2, "Address2")
      output.Add(city, "City")
      output.Add(state, "State")
      output.Add(zip, "Zip")
      Return output
End Function

Private Function getLastIndexOf(ByVal sArray As String(), ByVal checkArray As String()) As Integer
    Dim sourceIndex As Integer = 0
    Dim outputIndex As Integer = 0
    For Each item As String In checkArray
        For Each source As String In sArray
            If source.ToLower = item.ToLower Then
                outputIndex = sourceIndex
                If item.ToLower = "box" Then
                    outputIndex = outputIndex + 1
                End If
            End If
            sourceIndex = sourceIndex + 1
        Next
        sourceIndex = 0
    Next
    Return outputIndex
End Function

Passing the parseAddress function "A. P. Croll & Son 2299 Lewes-Georgetown Hwy, Georgetown, DE 19947" returns:

2299 Lewes-Georgetown Hwy

A. P. Croll & Son

Georgetown

DE

19947

Rob Allen 2008-08-19 19:52:34

Answer 14

+31 A:

I think outsourcing the problem is the best bet: send it to the Google (or Yahoo) geocoder. The geocoder returns not only the lat/long (which aren't of interest here), but also a rich parsing of the address, with fields filled in that you didn't send (including ZIP+4 and county).

For example, parsing "1600 Amphitheatre Parkway, Mountain View, CA" yields

{
  "name": "1600 Amphitheatre Parkway, Mountain View, CA, USA",
  "Status": {
    "code": 200,
    "request": "geocode"
  },
  "Placemark": [
    {
      "address": "1600 Amphitheatre Pkwy, Mountain View, CA 94043, USA",
      "AddressDetails": {
        "Country": {
          "CountryNameCode": "US",
          "AdministrativeArea": {
            "AdministrativeAreaName": "CA",
            "SubAdministrativeArea": {
              "SubAdministrativeAreaName": "Santa Clara",
              "Locality": {
                "LocalityName": "Mountain View",
                "Thoroughfare": {
                  "ThoroughfareName": "1600 Amphitheatre Pkwy"
                },
                "PostalCode": {
                  "PostalCodeNumber": "94043"
                }
              }
            }
          }
        },
        "Accuracy": 8
      },
      "Point": {
        "coordinates": [-122.083739, 37.423021, 0]
      }
    }
  ]
}

Now that's parseable!

James A. Rosen 2008-08-19 20:30:52

Since this is a batch process, I'd also suggest using a thread pool to do the geocoding so that you can submit multiple addresses at once (does google support a batch interface of any kind?)

David 2009-05-18 14:17:34

This would not really help with the address line two (pint 5 in the question)

Christopher Mahan 2010-01-02 07:31:44

The terms of service are often a limiting factor here for commercial and/or nonpublic uses.

Jay 2010-03-19 19:53:10

Answer 15

+1 A:

RecogniContact is a Windows COM object that parses US and European addresses. You can try it right on http://www.loquisoft.com/index.php?page=8

2009-05-12 14:07:31

Answer 16

A:

Does anyone know about FINALIST?

Does it really useful for address verification?

Eager to know how it works?

Arun Prasad 2009-12-27 21:19:59

Answer 17

A:

I would LOVE to get Google's code, since it's optimized for the international address standard. Probably have to depend on them for continued service, and the enventual fees.

Dennis 2010-01-02 06:34:01

Answer 18

A:

Try www.address-parser.com. We use their web service, which you can test online

We Know 2010-02-08 14:53:10

Answer 19

A:

Use www.match-logics.com Fuzzy Search Server that is designed specially for such king of problems.

paul 2010-05-11 18:22:00

Answer 20

+2 A:

+1 on James A. Rosen's suggested solution as it has worked well for me, however for completists this site is a fascinating read and the best attempt I've seen in documenting international addresses: http://www.columbia.edu/kermit/postal.html

weston 2010-09-17 20:05:38

Answer 21

A:

I don't know HOW FEASIBLE this would be, but I haven't seen this mentioned so I thought I would go ahead and suggest this:

If you are strictly in the US... get a huge database of all zip codes, states, cities and streets. Now look for these in your addresses. You can validate what you find by testing if, say, the city you found exists in the state you found, or by checking if the street you found exists in the city you found. If not, chances are John isn't for John's street, but is the name of the addressee... Basically, get the most information you can and check your addresses against it. An extreme example would be to get A LIST OF ALL THE ADDRESSES IN THE US OF A and then find which one has the most relevant match to each of your addresses...

Shawn 2010-10-01 02:50:01

ansaurus

tags:

views:

answers:

Parse usable Street Address, City, State, Zip from a string

related questions