ansaurus

Question

PHP Regex Difficulty

Answer 1

+1 A:

Take a look into the PCRE modifiers: http://ar2.php.net/manual/en/reference.pcre.pattern.modifiers.php

You can apply the s modifier, like '/id="content">(.*?)<SCRIPT/s' (Although, watch out, since it changes the way ^ and $ work, too.

Otherwise, you can do '/id="content">((.|\n)*?)<SCRIPT/'

EDIT: oops, wrong modifier...

Tordek 2009-05-24 07:13:24

You want /s not /m. /s changes the behavior of dot. /m changes ^ and $. "s (PCRE_DOTALL) If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded."

Schwern 2009-05-24 07:21:01

schwern, could you explain that? So is that /id="content">(.*?)<SCRIPT/s ?It doesn't quite seem to be working for me

chris 2009-05-24 07:41:27

Answer 2

A:

Try

id="content">((?:.|\n)*?)<SCRIPT

The usual warning not to parse HTML with regex applies, but you seem to know that already.

Alternatively:

(?<=id="content">)(?:.|\n)*?(?=<SCRIPT)

The dot does not match newline characters by default. One way to get around that is to explicitly allow them. This would work even if the regex flavor you happen to use did not support a "dotall" modifier.

The first regex is equal to your approach, extended by allowing \n. Your match would be in group 1, you only need to trim it.

The second regex uses zero-width assertions (look-ahead/look-behind) to mark the begin and the end of the match. The match would not contain anything you don't want, no grouping necessary.

Tomalak 2009-05-24 08:10:18

Answer 3

A:

Another solution without regular expressions:

$start = 'id="content">';
$end = '<SCRIPT';
if (($startPos = strpos($str, $start)) !== false &&
    ($endPos = strpos($str, $end, $startPos+1)) !== false) {
    $substr = substr($str, $startPos, $endPost-$startPos);
}

Gumbo 2009-05-24 08:33:30

great idea, didn't think of that

chris 2009-05-25 08:38:15

Answer 4

A:

Well, it is a multi line issue so take a look at pattern modifiers:

m (PCRE_MULTILINE) By default, PCRE treats the subject string as consisting of a single "line" of characters (even if it actually contains several newlines). The "start of line" metacharacter (^) matches only at the start of the string, while the "end of line" metacharacter ($) matches only at the end of the string, or before a terminating newline (unless D modifier is set). This is the same as Perl.

When this modifier is set, the "start of line" and "end of line" constructs match immediately following or immediately before any newline in the subject string, respectively, as well as at the very start and end. This is equivalent to Perl's /m modifier. If there are no "\n" characters in a subject string, or no occurrences of ^ or $ in a pattern, setting this modifier has no effect.

s (PCRE_DOTALL) If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl's /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier.

from http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

coma 2009-05-24 08:50:49

I would use also case-insensitive "i", because you can write <div>, <DIV>, etc...

Jet 2009-05-24 12:47:43

so, would i add /is at the end if i want to do both?

chris 2009-05-25 08:39:00

yes, you can add the pattern modifiers that way.

coma 2009-05-25 18:45:54

Answer 5

+2 A:

$s = '<div id="content">

<div id="contentleft">  <SCRIPT language=JavaScript>';

if( preg_match('/id="content">(.*?)<SCRIPT/s', $s, $matches) )
    print $matches[1]."\n";

Dot, by default, matches everything but newlines. /s makes it match everything.

But really, use a DOM parser. You can walk the tree or you can use an XPath query. Think of it like regexes for XML.

$s = '<div id="content">

<div id="contentleft">  <SCRIPT language=JavaScript>';

// Load the HTML
$doc = new DOMDocument();
$doc->loadHTML($s);

// Use XPath to find the <div id="content"> tag's descendants.
$xpath = new DOMXPath($doc);
$entries = $xpath->query("//div[@id='content']/descendant::*");

foreach( $nodes as $node ) {
    // Stop when we see <script ...>
    if( $node->nodeName == "script" )
        break;

    // do what you want with the content
}

XPath is extremely powerful. Here's some examples.

PS I'm sure (I hope) the above code can be tightened up some.

Schwern 2009-05-24 09:02:45

Answer 6

A:

$dom = new DOMDocument();
$dom->strictErrorChecking = false;
$dom->loadHTML($html_str);

$xpath = new DOMXPath($dom);
$div = $xpath->query('div[@id="content"]')->item(0);

Please, correct my xpath expression - not sure if it will work...

Jet 2009-05-24 12:46:27

ansaurus

tags:

views:

answers:

PHP Regex Difficulty

related questions