tags:

views:

638

answers:

3

I'm having trouble capturing this data:

              <tr>
                <td><span class="bodytext"><b>Contact:</b><b></b></span><span style='font-size:10.0pt;font-family:Verdana;
  mso-bidi-font-family:Arial'><b> </b> 
                      <span class="bodytext">John Doe</span> 
                     </span></td>
              </tr>
              <tr>
                <td><span class="bodytext">PO Box 2112</span></td>
              </tr>
              <tr>
                <td><span class="bodytext"></span></td>
              </tr>

        <!--*********************************************************


        -->
        <tr>
                <td><span class="bodytext"></span></td>
              </tr>



              <tr>
                <td><span class="bodytext">JOHAN</span> NSW 9700</td>
              </tr>
              <tr>
                <td><strong>Phone:</strong> 
                02 9999 9999
                    </td>
              </tr>

Basically, I want to grab everything after "Contact:" and before "Phone:" minus the HTML; however these two designations may not always exist so I need to really grab everything between the two colons (:) that isn't located inside a HTML tag. The number of <span class="bodytext">***data***</span> may actually vary so I need some sort of loop for matching these.

I prefer to use regular expressions as I could probably do this using loops and string matches.

Also, I'd like to know the syntax for non-matching groups in PHP regex.

Any help would be greatly appreciated!

+1  A: 

If I understand you correctly, you're only interested in the text between the HTML tags. To ignore the HTML tags, simply strip them first:

$text = preg_replace('/<[^<>]+>/', '', $html);

To grab everything between "Contact:" and "Phone:", use:

if (preg_match('/Contact:(.*?)Phone:/s', $text, $regs)) {
  $result = $regs[1];
} else {
  $result = "";
}

To grab everything between two colons, use:

if (preg_match('/:([^:]*):/', $text, $regs)) {
  $result = $regs[1];
} else {
  $result = "";
}
Jan Goyvaerts
A: 

The seemingly arbitrary stack overflow response to these sort of questions seems to be "omg don't use regexes! Use Beautiful Soup instead!!". Personally I prefer not having to use external libraries for small tasks like this, and regexes are a good alternative.

A simple way to strip out all the HTML tags, which is one way to tackle this, is to use this regex:

$text = preg_replace("/<.*?>/", "", $text);

then you can use whatever method you like to grab the appropriate text content.

Non matching groups are like this: (?:this won't match)

nickf
(?this won't match) is a syntax error
Jan Goyvaerts
So what is it? RegexBuddy gave me (?:this won't match) as PERL regex but there was no PHP option couldn't be sure...
E3
PHP's preg functions use the PCRE flavor, which is an option in RegexBuddy. nickf's answer missed the : before he edited it.
Jan Goyvaerts
I believe you (and the OP) mean "non-capturing groups", instead "non-matching groups". A non-*matching* group would be something like this: "(X(?<!X))". ;-)
Tomalak
A: 

Sounds like screenscraping, or you could use strip_tags() as well after finding the info you wanted.

Phill Pafford