ansaurus

Question

Extract Address from String in PHP with RegEx

Answer 1

A:

You question isn't very clear to me, but if I understood you correctly I guess you could use a DOM parser to match the p tags and then check if any of them has the word "Washington" or if the phone number matches the Washington area.

Alix Axel 2009-12-26 02:03:00

The sources won't always have `p` tags. This must be regex-based from what I can tell.

Jonathan Sampson 2009-12-26 02:05:30

Answer 2

+2 A:

EDIT: It appears as though the [anything] data in between the first set of numbers and 'washington' has to be a little more restrictive to work properly. The [anything] section should not contain any numbers, as, well, numbers are what we use to delimit the start of one of the addresses. This works for the three websites you gave us.

I'd say the best first step would be to strip out all HTML tags and replace the ' ' character entity:

$input = strip_tags($input);
$input = preg_replace("/&nbsp;/"," ",$input);

then if the addresses match (close to) the format you specified, do:

$results= array();
preg_match("/[0-9]+\s+[^0-9]*?\s+washington,?\s*D\.?C\.?[^0-9]+[0-9]{5}/si",$input,$results);
foreach($result[0] as $addr){
    echo "$addr<br/>";
}

This works for the three examples you provided, and $results[0] should contain each of the addresses found.

However, this won't work, for instance, if the address has an 'Apartment #2' or the like in it, because it assumes that the numbers closest to 'Washington, DC' mark the start of the address.

The following script matches each of the test cases:

<?php
    $input = "
        1433&nbsp;Longworth House Office Building Washington,  D.C. 20515
         332 Cannon HOB                      Washington   DC   20515
        1641 LONGWORTH HOUSE OFFICE BUILDING WASHINGTON,  DC   20515
        1238 Cannon H.O.B.
        Washington, DC 20515
        8293 Longworth House Office Building • Washington DC • 20515
        8293 Longworth House Office Building | Washington DC | 20515
    ";
    $input = strip_tags($input);
    $input = preg_replace("/&nbsp;/"," ",$input);

    $results= array();
    preg_match_all("/[0-9]+\s+[^0-9]*?washington,?\s*D\.?C\.?[^0-9]*?[0-9]{5}/si",$input,$results);
    foreach($results[0] as $addr){
        echo "$addr<br/>";
    }

cmptrgeekken 2009-12-26 02:49:54

It's superfluous to surround the whole regex with parentheses. It gets captured in `$matches[0]` anyway.

Geert 2009-12-26 06:04:43

I've updated the original question, please take a look at the changes.

Jonathan Sampson 2009-12-26 07:46:37

Answer 3

A:

This regex takes a more flexible approach towards what the input string can contain. The "Washington, DC" part has not been hard-coded into it. The different parts of the addresses are captured separately, the whole address will be captured in $matches[0].

$input = strip_tags($input);
preg_match('/
(\d++)    # Number (one or more digits) -> $matches[1]
\s++      # Whitespace
([^,]++), # Building + City (everything up until a comma) -> $matches[2]
\s++      # Whitespace
(\S++)    # "DC" part (anything but whitespace) -> $matches[3]
\s++      # Whitespace
(\d++)    # Number (one or more digits) -> $matches[4]
/x', $input, $matches);

Geert 2009-12-26 06:03:10

This is close, but it assumes there will always be a comma. Please re-evaluate the various formats listed in the original question.

Jonathan Sampson 2009-12-26 09:14:24

Answer 4

+1 A:

EDIT:

After looking at the sites you mentioned, I think the following should work. Assuming that you have the contents of the page you crawled in a variable called $page, then you could use

$subject = strip_tags($page)

to remove all HTML markup from the page; then apply the regex

(\d+)\s*(.*?)\s*washington.{0,5}(DC|D.C.).{0,5}(\d{5})

RegexBuddy generates the following code for this (I don't know PHP):

if (preg_match('/(\d+)\s*(.*?)\s*washington.{0,5}(DC|D.C.).{0,5}(\d{5})/si', $subject, $regs)) {
    $result = $regs[0];
} else {
    $result = "";
}

$regs[1] would then contain the contents of the first capturing parens (numbers), and so forth.

Note the use of the /si modifiers to make the dot match newlines, and to make the regex case-insensitive.

Tim Pietzcker 2009-12-26 08:33:13

Close, but these "anything" should probably be limited to 5 chars, max. Right now, this regex brings in paragraphs qualified under the [anything] blocks. My fault though, since I was too vague.

Jonathan Sampson 2009-12-26 08:41:59

No problem, just replace the `.*?` by `.{0,5}` - I edited my answer accordingly.

Tim Pietzcker 2009-12-26 08:53:56

The following doesn't seem to be matching addresses any longer: `/(\d+).{1,5}washington.{1,5}(DC|D.C.).{1,5}(\d{5})/si`

Jonathan Sampson 2009-12-26 09:09:47

Ah yes, the first "anything" in your examples contains a lot more than 5 characters: ` LONGWORTH HOUSE OFFICE BUILDING `, for example. So I changed that back to `.*?`. If you need to capture the text here, then enclosed it in parentheses, like `(.*?)`.

Tim Pietzcker 2009-12-26 09:13:21

Oops, good point. This is still not matching the address found on http://giffords.house.gov for unfortunately. I currently have:`/(\d+).{1,35}\swashington.{1,5}(DC|D.C.).{1,5}(\d{5})/si`

Jonathan Sampson 2009-12-26 09:18:06

ansaurus

tags:

views:

answers:

Extract Address from String in PHP with RegEx

related questions