views:

392

answers:

4

I have to create a loop, and with a regexp populate any of the 4 variables

$address, $street, $town, $lot

The loop will be fed a string that may have info in it like the lines below

  • '123 any street, mytown' or
  • 'Lot 4 another road, thattown' or
  • 'Lot 2 96 other road, her town' or
  • 'this ave, this town' or
  • 'yourtown'

since anything after a comma is the $town I thought

(.*), (.*)

then the first capture could be checked with (Lot \d*) (.*), (.*) if the 1st capture starts with a number, then its the address (if word with white space its $street) if one word, its just the $town

+7  A: 

I'd suggest you don't try to do all of this in a single regexp as it will be hard to verify its correctness.

First, I'd split at the comma. Whatever comes after the comma is the $town, and if there is no comma, the whole string is the $town.

Then I'd check if there is any lot information and extract it from the string.

Then I'd look for street/avenue number and name.

Divide and conquer :)

Hans W
+1  A: 

This should separate into 3 parts - how do you distinguish the address/street?

(Lot \d*)? ?([^,]*,)? ?(.*)

here is the breakdown for your examples

('', '123 any street,', 'mytown')
('Lot 4', 'another road,', 'thattown')
('Lot 2', '96 other road,', 'her town')
('', 'this ave,', 'this town')
('', '', 'yourtown')

If I understand correctly, this one separates the address/street as well

(Lot \d*)? ?(\d*) ?([^,]*,)? ?(.*)

('', '123', 'any street,', 'mytown')
('Lot 4', '', 'another road,', 'thattown')
('Lot 2', '96', 'other road,', 'her town')
('', '', 'this ave,', 'this town')
('', '', '', 'yourtown')
gnibbler
House numbers aren't that simple; they can have letters after them (or even IIRC before them) or 1/2 and the like after them.
ysth
@ysth, We test cases to cover those then. Extending the regex is not so difficult - guessing the requirements is.
gnibbler
A: 

I can't match the last one but for the first 3 ones you can use something like this:

if (preg_match('/(?:Lot (\d*)|)(?: |)(?:(\d*)|) (.*), (.*)/m', $subject, $regs)) {
    $result = $regs[1];
} else {
    $result = "";
}

this is the testing regex:

(?:Lot (\d*)|)(?: |)(?:(\d*)|) (.*), (.*)

You can use this in regexbuddy to test: link

RJD22
+7  A: 

Take a look at Geo::StreetAddress::US if these are U.S. addresses.

Even if they are not, the source of this module should give you an idea of what is involved in parsing free form street addresses.

Here is a script that handles the addresses you posted (updated, earlier version combined lot and number into one string):

#!/usr/bin/perl

use strict; use warnings;

local $/ = "";

my @addresses;

while ( my $address = <DATA> ) {
    chomp $address;
    $address =~ s/\s+/ /g;
    my (%address, $rest);
    ($address{town}, $rest) = map { scalar reverse }
                        split( / ?, ?/, reverse($address), 2 );

    {
        no warnings 'uninitialized';
        @address{qw(lot number street)} =
            $rest =~ /^(?:(Lot [0-9]) )?(?:([0-9]+) )?(.+)\z/;
    }
    push @addresses, \%address;
}

use Data::Dumper;
print Dumper \@addresses;

__DATA__
123 any street,
mytown

Lot 4 another road,
thattown

Lot 2 96 other road,
her town

yourtown

street,
town

Output:

$VAR1 = [
          {
            'lot' => undef,
            'number' => '123',
            'street' => 'any street',
            'town' => 'mytown'
          },
          {
            'lot' => 'Lot 4',
            'number' => undef,
            'street' => 'another road',
            'town' => 'thattown'
          },
          {
            'lot' => 'Lot 2',
            'number' => '96',
            'street' => 'other road',
            'town' => 'her town'
          },
          {
            'lot' => undef,
            'number' => undef,
            'street' => undef,
            'town' => 'yourtown'
          },
          {
            'lot' => undef,
            'number' => undef,
            'street' => 'street',
            'town' => 'town'
          }
        ];
Sinan Ünür