views:

69

answers:

2

Hi all, I'm pretty clueless when it comes to PHP and regex but I'm trying to fix a broken plugin for my forum.

I'd like to replace the following:

<blockquote rel="blah">foo</blockquote>

With

<blockquote class="a"><div class="b">blah</div><div class="c"><p>foo</p></div></blockquote>

Actually, that part is easy and I've already partially fixed the plugin to do this. The following regex is being used in a call to preg_replace_callback() to do the replacement:

/(<blockquote rel="([\d\w_ ]{3,30})">)(.*)(<\/blockquote>)/u

The callback code is:

return <<<BLOCKQUOTE
<blockquote class="a"><div class="b">{$Matches[2]}</div><div class="c"><p>{$Matches[3]}</p></div></blockquote>
BLOCKQUOTE;

And that works for my above example (non-nested blockquotes). However, if the blockquotes are nested, such as in the following example:

<blockquote rel="blah">foo <blockquote rel="bloop">bar ...maybe another nest...</blockquote></blockquote>

It doesn't work. So my question is, how can I replace all nested blockquotes using a combination of regex/PHP? I know recursive patterns are possible in PHP with (?R); the following regex will extract all nested blockquotes from the string containing them:

/(<blockquote rel="([\d\w_ ]{3,30})">)(.*|(?R))(<\/blockquote>)/s

But from there on I'm not quite sure what to do in the preg_replace_callback() callback to replace each nested blockquote with the above replacement.

Any help would be appreciated.

+6  A: 

The simple answer is that you can't do this with regex. The language of nested tags (or parens, or brackets, or anything) of an arbitrary depth is not regular and hence cannot be matched with a regular expression. I would suggest you use a DOM parser or - if absolutely necessary for some weird reason - write your own parsing scheme.

The complicated answer is that you might be able to do this with some really ugly, hacky regex and PHP code, but I wouldn't advise it to be quite honest.

See also: The Chomsky hierarchy.

Also see also:

eldarerathis
A: 

There's no direct support for recursive substitutions, and preg_replace_callback() isn't particularly useful in this case. But there's nothing stopping you doing the substitution in multiple passes. The first pass takes care of the outermost tags, and subsequent passes work their way inward. The optional $count argument tells you how many replacements were performed in each pass; when it comes up zero, you're done.

$regex = '~(<BQ rel="([^"]++)">)((?:(?:(?!</?+BQ\b).)++|(?R))*+)(</BQ>)~s';
$sub = '<BQ class="a"><div class="b">$2</div><div class="c"><p>$3</p></div></BQ>';
do {
  $s = preg_replace($regex, $sub, $s, -1, $count);
} while ($count != 0);

See it in action on ideone.com

Alan Moore