tags:

views:

1387

answers:

6

I have an address class that uses a regular expression to parse the house number, street name, and street type from the first line of an address. This code is generally working well, but I'm posting here to share with the community and to see if anyone has suggestions for improvement.

Note: The STREETTYPES and QUADRANT constants contain all of the relevant street types and quadrants respectively.

I've included a subset here:

private const string STREETTYPES = @"ALLEY|ALY|ANNEX|AX|ARCADE|ARC|AVENUE|AV|AVE|BAYOU|BYU|BEACH|...";

private const string QUADRANTS = "N|NORTH|S|SOUTH|E|EAST|W|WEST|NE|NORTHEAST|NW|NORTHWEST|SE|SOUTHEAST|SW|SOUTHWEST";

HouseNumber, Quadrant, StreetName, and StreetType are all properties on the class.

    private void Parse(string line1)
 {
        HouseNumber = string.Empty;
        Quadrant = string.Empty;
        StreetName = string.Empty;
        StreetType = string.Empty;

        if (!String.IsNullOrEmpty(line1))
        {
            string noPeriodsLine1 = String.Copy(line1);
            noPeriodsLine1 = noPeriodsLine1.Replace(".", "");

            string addressParseRegEx =
                @"(?ix)
            ^
            \s*
            (?:
               (?<housenumber>\d+)
               (?:(?:\s+|-)(?<quadrant>" +
                QUADRANTS +
                @"))?
               (?:(?:\s+|-)(?<streetname>\S+(?:\s+\S+)*?))??
               (?:(?:\s+|-)(?<quadrant>" +
                QUADRANTS + @"))?
               (?:(?:\s+|-)(?<streettype>" + STREETTYPES +
                @"))?
               (?:(?:\s+|-)(?<streettypequalifier>(?!(?:" +
                QUADRANTS +
                @"))(?:\d+|\S+)))?
               (?:(?:\s+|-)(?<streettypequadrant>(" +
                QUADRANTS + @")))??
               (?:(?:\s+|-)(?<suffix>(?:ste|suite|po\sbox|apt)\s*\S*))?
            |
               (?:(?:po|postoffice|post\s+office)\s+box\s+(?<postofficebox>\S+))
            )
            \s*
            $
            ";
            Match match = Regex.Match(noPeriodsLine1, addressParseRegEx);
            if (match.Success)
            {
                HouseNumber = match.Groups["housenumber"].Value;
                Quadrant = (string.IsNullOrEmpty(match.Groups["quadrant"].Value)) ? match.Groups["streettypequadrant"].Value : match.Groups["quadrant"].Value;
                if (match.Groups["streetname"].Captures.Count > 1)
                {
                    foreach (Capture capture in match.Groups["streetname"].Captures)
                    {
                        StreetName += capture.Value + " ";
                    }
                    StreetName = StreetName.Trim();
                }
                else
                {
                    StreetName = (string.IsNullOrEmpty(match.Groups["streetname"].Value)) ? match.Groups["streettypequalifier"].Value : match.Groups["streetname"].Value;
                }
                StreetType = match.Groups["streettype"].Value;

                //if the matched street type is found
                //use the abbreviated version...especially for credit bureau calls
                string streetTypeAbbreviation;
                if (StreetTypes.TryGetValue(StreetType.ToUpper(), out streetTypeAbbreviation))
                {
                    StreetType = streetTypeAbbreviation;
                }
            }
        }

 }
+4  A: 

I don't know what country you're in, but if you're in the USA and want to spend some money on address validation, you can buy related USPS products here. And here is a good place to find free word lists from the USPS for expected words and abbreviations. I'm sure similar pages are available for other countries.

Adrian Archer
Forgot to include that one stipulation... has to be free. :)
Matt Ruwe
Also, yes, this is only for US addresses
Matt Ruwe
+6  A: 

I think you should clarify your usage scenario.

Unless you're in a very, very limited scenario where you know that the addresses were entered following a strict schema, parsing addresses for content is an extremely hard problem to solve and, usually, quite futile (unless it's the raison d'être of your application).

If you're limited to a particular country that has very specific conventions for writing addresses, then using these regex might get you 90% of the way.
However, as soon as you have to start accepting foreign addresses, you're screwed.
Even if you're a US-centric site, there is a good chance that you may have to be able to accept addresses from US citizen living abroad for instance.

Again, it may be OK in a very narrow field, but it's almost always a bad idea to validate or split addresses that were not strictly validated and constrained at the time the user entered them.
When you do enforce some strict rules for users to enter their addresses, these end-up being inadequate in a small portion of cases, even in the best address validation components out there.

Just a few things that mess up address parsing:

  • postal codes (Zip codes) are sometimes placed before, after, or may even not exist at all.
  • postal codes follow strict rules: a 10-digit Zip code is probably easy to spot as invalid, but what about a non-existent one? What about more codes such as those used in the UK for instance?
  • What about a place like Hong Kong where you could write the address in either English, Traditional Chinese or Mandarin?
  • What if it's perfectly fine to split your address and write it out of sequence?
  • even if you're just parsing US addresses, there are at least a handfull of ways to describe a PO box: you can also use poste restante, general delivery and then need to add a 4-digit code to the Zip code, which would normally probably not be present at all...

Bottom line is

If getting addresses in a parseable format is really important, be 100% sure that you can get all possible combinations right or you're going to have a percentage of failures that will mean frustrated users and loss sales.
If you don't have 100% case coverage then don't enforce strict rules on the user.
I can't count the number of websites I gave up purchasing from because they would require a Zip/Postal Code when the place I live in has none.

Sorry for the rant, but I think it's important that people wanting to do address validation and parsing think hard about what they're getting themselves in.

Renaud Bompuis
Just a few notes:This application is only for US citizens. The law prohibits the company from working with foreign entities, so that shouldn't be a problem. Also, this is only for parsing the first line of the address (e.g. 12345 Main St). I'm not concerned with the city state or zip code.
Matt Ruwe
A: 

I tried to get this to work, but it seems as though you have a static member of a StreetTypes class that is not included. It seems to work except for that, but I can not do much testing without it.

There's a STREETTYPE constant defined in the original question. Use that.
Matt Ruwe
+3  A: 

Have fun with addresses and regexs, you're in for a long, horrible ride.

You're trying to lay order upon chaos.

For every "123 Simple Way", there's a "14 1/2 South".

Then, for extra laughs, there's Salt Lake City: "855 South 1300 East".

Have fun with that.

There are more exceptions than rules when it comes to street adresses.

Will Hartung
+1  A: 

I'll agree that your strictness is going to be a problem. I'm writing an address parser designed to strip addresses from classified ads where the format could be just about anything. For instance, for your quadrant matches, you're ignoring punctuation altogether. I have to search data that could represent NE in all these different ways:

"NE", "N.E", "N E", "N.E.", "N. E", "North East", "Northeast"

so I am using the following pattern match which should catch all direction qualifiers no matter how they are expressed:

\b(?:(?:[nesw]\.? ?){0,2}|(?:north|no\.|east|south|so\.|west){0,2})\b

Of course, context is also important since "no" is going to be matched by this. But "NE" for Nebraska would be matched by either, so you really have to be careful about what's to the left and right in your larger expression. I'm having to compile lists of words that commonly appear interspersed in address texts which are not address components, such as "near, x-street, in, across", etc.

It is a very tough problem, and I agree Salt Lake City is a bitch. In addition to having the double direction/coordinate format, they also compound it by referring to stuff like "3700 North 5300 East Arborville Way" where the streets can be referenced by name, number, or both.

Victorb
Remember, this algorithm is only for matching the address portion of the overall address (e.g. 123 Simple Way)... I'm not concerned with city, state, or zip code.
Matt Ruwe
+2  A: 

This actually works pretty well except that it doesn't pull apartment numbers. We're working on that. It also coughed a little when we had an address of 769 Branch Ave. Of course "branch" is one of the street types that its looking for. It all goes back that making order out of chaos thing. We know that its going to break here and there.

Curtis Maurand
Agreed... It would be impossible to make something bullet proof.
Matt Ruwe