tags:

views:

46

answers:

2

I have the following sample set of data:

<p>first line\n
second line\n
third line\n</p>
first line\n
second line\n
third line\n

Using regex, how could I match on the newline characters, but only when they are within the paragraph tags.

This code would be used within php.

+2  A: 

You could split this in two regex's. First split on your <p> tags (<p>.*?</p>) , then match on newline from the result.

Divide and conquer. Several small regex's will often perform faster than huge ones.

I assume you have total control over the html and know it's well formed. Because using regex on html is a no-no in most cases. Use a DOM parser instead.

Mikael Svenson
+1  A: 

Well, regex are not well suited to parsing HTML (use DomDocument for that). You also said that you want to "match on". Does that mean capture? Replace? "Check for"? Assuming check for, here's a crude one:

$regex = '#(?i:<p[^>]*>[^\\n]*)(\\n)(?i:[^<]*</p>)#';

It won't match <p><i>foo\n</i></p>, but it will match the case where there is a new line inside of a basic <p> tag (with no html children).

What I'd suggest, is grabbing DomDocument, and doing something like this:

$dom = new DomDocument();
$dom->loadHTML($html);
$pTags = $dom->getElementsByTagName('p');
foreach ($pTags as $p) { 
    $txt = $p->textContent;
    if (strpos($txt, "\n") !== false) {
        //You found a \n within a P tag
    }
}
ircmaxell