views:

507

answers:

4

We are developing a c# application that imports address-data into a CRM-system. The CSV-file contains an address-column like 'Somethingstreet 34'. Our CRM however uses two different fields for the streetname and the housenumber. Of course, in the given example this poses no problem. But our Dutch addressing-system can be a bit of a pain.

Real world examples:

  • Somestreet 88a (where 'Somestreet' is the streetname and 88a the housenumber)
  • 2e van Blankenburgstraat 123a (where '2e van Blankenburgstraat' is the streetname, and '123a' is the housenumber)
  • 2e van Blankenburgstraat 123-a (where '2e van Blankenburgstraat' is the streetname, and '123-a' is the housenumber)
  • 2e van Blankenburgstraat 123 a (where '2e van Blankenburgstraat' is the streetname, and '123 a' is the housenumber)

Now I'm looking for a nice function (RegEx or something) that splits these addresslines correctly into two fields. Is there a nice clean way to do this ?


edit:

I did some further investigation on our addressing system and it seems (thank you government) that the above examples are not even the 'worst' ones.

Some more (these are real streets and numbers):

  • Rivium 1e Straat 53/ET6 (where 'Rivium 1e Straat' is the street and '53/ET6' is the housenumber)
  • Plein 1940-1945 34 (where 'Plein 1940-1945' is the street and '34' is the housenumber)
  • Apollo 11-Laan 11 (where 'Apollo 11-Laan' is the street and '11' (the second one) is the housenumber)
  • Charta 77 Vaart 159 3H (where 'Charta 77 Vaart' is the streetname and '159 3H' is the housenumber)
  • Charta 77 Vaart 44/2 (where 'Charta 77 Vaart' is the streetname and '44/2' is the housenumber)
A: 

What I did, but I doubt that it is the most performant solution is to reverse the address and then get the first part till you find a digit and take them all. i.e. the regex .*\d+ on the reversed address. This solves your problem when a street contains a digit.

Ruben
A: 

Can you do something where you split on spaces, and then check to see if the first character of some interior string is an integer?

like

 char[] splits = new char[1];
 splits[0] = ' ';
 string[] split = addressLine.split(splits);
 int splitLoc = -1, i;
 for (i =1; i < split.Length; i++){//start at 1 to avoid the first '2e' streets
     int theFirstDigit = -1;
     try{
        theFirstDigit = int.Parse(split[i].Substring(0,1));
     }catch {
        //ignore; parse fails with an exception
     }
     if (theFirstDigit != -1){
         splitLoc = i;
         break;
     }
 }
 if (splitLoc < 0) return; //busted
 string field1, field2;
 for (i = 0; i < splitLoc; i++){
     field1+= split[i] + " ";
 }

 for (i = splitLoc; i < split.Length; i++){
     field2+= split[i] + " ";
 }

Depends on what you mean by 'clean', but it does look like that would work, if all addresses can be formed the way you specified.

mmr
A: 

The best solution for data correctness would be to compare the existing database against a known address api that has a function to do this for you. Otherwise you're just giving your best guess and some, if not all, of the data should be manually reviewed.

Greg
A: 

There are too many different ways someone could enter this data. I often write my address as:

123 Foo Street Apt#3

ie with the house and apartment numbers on either end of the street name

If this was my problem I would write a regex that handles the "easy" ones and flags the complicated ones for human review.

You can find a list of street names in the US from the Census Bureau but it is buried inside a monster datafile

Autodidact