ansaurus

Question

Need help with extracting data with PHP Regular Expressions

Answer 1

+1 A:

If I understand you correctly, you're only interested in the text between the HTML tags. To ignore the HTML tags, simply strip them first:

$text = preg_replace('/<[^<>]+>/', '', $html);

To grab everything between "Contact:" and "Phone:", use:

if (preg_match('/Contact:(.*?)Phone:/s', $text, $regs)) {
  $result = $regs[1];
} else {
  $result = "";
}

To grab everything between two colons, use:

if (preg_match('/:([^:]*):/', $text, $regs)) {
  $result = $regs[1];
} else {
  $result = "";
}

Jan Goyvaerts 2008-12-18 02:38:55

Answer 2

A:

The seemingly arbitrary stack overflow response to these sort of questions seems to be "omg don't use regexes! Use Beautiful Soup instead!!". Personally I prefer not having to use external libraries for small tasks like this, and regexes are a good alternative.

A simple way to strip out all the HTML tags, which is one way to tackle this, is to use this regex:

$text = preg_replace("/<.*?>/", "", $text);

then you can use whatever method you like to grab the appropriate text content.

Non matching groups are like this: (?:this won't match)

nickf 2008-12-18 02:39:27

(?this won't match) is a syntax error

Jan Goyvaerts 2008-12-18 02:48:43

So what is it? RegexBuddy gave me (?:this won't match) as PERL regex but there was no PHP option couldn't be sure...

E3 2008-12-18 02:58:11

PHP's preg functions use the PCRE flavor, which is an option in RegexBuddy. nickf's answer missed the : before he edited it.

Jan Goyvaerts 2008-12-18 08:09:46

I believe you (and the OP) mean "non-capturing groups", instead "non-matching groups". A non-*matching* group would be something like this: "(X(?<!X))". ;-)

Tomalak 2008-12-18 08:17:23

Answer 3

A:

Sounds like screenscraping, or you could use strip_tags() as well after finding the info you wanted.

Phill Pafford 2009-10-05 13:33:27

ansaurus

tags:

views:

answers:

Need help with extracting data with PHP Regular Expressions

related questions