views:

51

answers:

3

Hey guys, I am trying to match "address" in this page -

http://www.bbb.org/norfolk/business-reviews/tax-return-preparation/liberty-tax-service-in-virginia-beach-va-48000604

The source of address part has this HTML

<tr>
    <td align="right" class="generalinfo_left">Address:</td>
    <td class="generalinfo_right">1 S Main St Ste 1430<br /></td>
</tr>
<tr>
    <td align="right" class="generalinfo_left"></td>
    <td class="generalinfo_right">Dayton, OH 45402</td>
</tr>

So, I tried the following RegEx in PHP.

"%Address:</td>(.*?)(?!<br />)</td>%s"

where "s" is the modifier for "." to match new lines too. But it is not working. It doesnt matches the "Dayton, OH 45402" part. Can anyone tell me why?

+1  A: 

Please don't try to parse HTML with regular expressions, it invokes the wrath of Zalgo.

Try using the DOM and xpath to target the specific elements and attributes you are attempting to extract.

(I'd provide an xpath example, but it's still on my to-learn list... :) )

Charles
A: 

The .*? goes all the way to the end of the <br />. Then, the next text is "</td>", so the lookahead fails and the match succeeds, with the capture being, "<td class="generalinfo_right">1 S Main St Ste 1430<br />". In other words, the lookahead doesn't prevent the match because it's too late.

There are ways to write it correctly (e.g. you could explicitly add the <tr> and then <td class="generalinfo_right">. However, Charles is right that you should use a real parser.

Matthew Flaschen
A: 

It's pretty normal: If you look at your sample text, you will see that between Address and Dayton, OH 45402, you have <br />. (?!<br />) specifically states that it should not match if <br /> is found.

You should use parser for HTML.

That said, assuming that all your files are exactly like this sample, this ugly regex should work:

%(Address:)(.*?generalinfo_right">)(.*?)((<br />)|(</td>))(.*?generalinfo_right">)(.*?)((<br />)|(</td>))%s

Groups 1, 3 and 8 contain the address.

However, since most likely your documents are not all exactly like that, a much better solution will be to parse HTML with a proper parser.

Sylverdrag
Thanks it works!And Definately, I will try out the parsers to parse HTML.
Shubham
It's very tempting to downvote for this horrific abuse of parentheses. Your expression only needs four groups - two non-capturing - not the *eleven* captures you are using!
Peter Boughton
@Peter: LOL. Relax, those parentheses were on sale, they don't cost much. As I said, this is an ugly regex, but it works and since my suggestion is to use a parser, I didn't take any time to make it look any better than it does now, I just typed up the first thing that came to mind.
Sylverdrag