tags:

views:

164

answers:

6

I am having difficulty doing regular expressions when there is whitespace and carriage returns in between the text.

For example in this case below, how can I get the regular expression to get "<div id="contentleft">"?

<div id="content"> 


<div id="contentleft">  <SCRIPT language=JavaScript>

I tried

id="content">(.*?)<SCRIPT

but it doesn't work.

+1  A: 

Take a look into the PCRE modifiers: http://ar2.php.net/manual/en/reference.pcre.pattern.modifiers.php

You can apply the s modifier, like '/id="content">(.*?)<SCRIPT/s' (Although, watch out, since it changes the way ^ and $ work, too.

Otherwise, you can do '/id="content">((.|\n)*?)<SCRIPT/'

EDIT: oops, wrong modifier...

Tordek
You want /s not /m. /s changes the behavior of dot. /m changes ^ and $. "s (PCRE_DOTALL) If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded."
Schwern
schwern, could you explain that? So is that /id="content">(.*?)<SCRIPT/s ?It doesn't quite seem to be working for me
chris
A: 

Try

id="content">((?:.|\n)*?)<SCRIPT

The usual warning not to parse HTML with regex applies, but you seem to know that already.

Alternatively:

(?<=id="content">)(?:.|\n)*?(?=<SCRIPT)

The dot does not match newline characters by default. One way to get around that is to explicitly allow them. This would work even if the regex flavor you happen to use did not support a "dotall" modifier.

The first regex is equal to your approach, extended by allowing \n. Your match would be in group 1, you only need to trim it.

The second regex uses zero-width assertions (look-ahead/look-behind) to mark the begin and the end of the match. The match would not contain anything you don't want, no grouping necessary.

Tomalak
A: 

Another solution without regular expressions:

$start = 'id="content">';
$end = '<SCRIPT';
if (($startPos = strpos($str, $start)) !== false &&
    ($endPos = strpos($str, $end, $startPos+1)) !== false) {
    $substr = substr($str, $startPos, $endPost-$startPos);
}
Gumbo
great idea, didn't think of that
chris
A: 

Well, it is a multi line issue so take a look at pattern modifiers:

m (PCRE_MULTILINE) By default, PCRE treats the subject string as consisting of a single "line" of characters (even if it actually contains several newlines). The "start of line" metacharacter (^) matches only at the start of the string, while the "end of line" metacharacter ($) matches only at the end of the string, or before a terminating newline (unless D modifier is set). This is the same as Perl.

When this modifier is set, the "start of line" and "end of line" constructs match immediately following or immediately before any newline in the subject string, respectively, as well as at the very start and end. This is equivalent to Perl's /m modifier. If there are no "\n" characters in a subject string, or no occurrences of ^ or $ in a pattern, setting this modifier has no effect.

s (PCRE_DOTALL) If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl's /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier.

from http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

coma
I would use also case-insensitive "i", because you can write <div>, <DIV>, etc...
Jet
so, would i add /is at the end if i want to do both?
chris
yes, you can add the pattern modifiers that way.
coma
+2  A: 
$s = '<div id="content">

<div id="contentleft">  <SCRIPT language=JavaScript>';

if( preg_match('/id="content">(.*?)<SCRIPT/s', $s, $matches) )
    print $matches[1]."\n";

Dot, by default, matches everything but newlines. /s makes it match everything.

But really, use a DOM parser. You can walk the tree or you can use an XPath query. Think of it like regexes for XML.

$s = '<div id="content">

<div id="contentleft">  <SCRIPT language=JavaScript>';

// Load the HTML
$doc = new DOMDocument();
$doc->loadHTML($s);

// Use XPath to find the <div id="content"> tag's descendants.
$xpath = new DOMXPath($doc);
$entries = $xpath->query("//div[@id='content']/descendant::*");

foreach( $nodes as $node ) {
    // Stop when we see <script ...>
    if( $node->nodeName == "script" )
        break;

    // do what you want with the content
}

XPath is extremely powerful. Here's some examples.

PS I'm sure (I hope) the above code can be tightened up some.

Schwern
A: 
$dom = new DOMDocument();
$dom->strictErrorChecking = false;
$dom->loadHTML($html_str);

$xpath = new DOMXPath($dom);
$div = $xpath->query('div[@id="content"]')->item(0);

Please, correct my xpath expression - not sure if it will work...

Jet