I agree with Brian that regular expressions are not a good way to parse HTML, but if you must use regular expressions, you could try splitting the string into tokens and then running your regexp on each token.
I'm using preg_split
to split the string on HTML tags, as well as on the phrase <sup>®</sup>
-- this will leave text that's either not an already superscript ®
or a tag as tokens. Then for each token, ®
can be replaced with <sup>®</sup>
:
$regex = '/(<sup>®<\/sup>|<.*?>)/i';
$original = '<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>';
// we need to capture the tags so that the string can be rebuilt
$tokens = preg_split($regex, $original, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
/* $tokens => Array
(
[0] => <div>
[1] => asd® asdasd. asd
[2] => <sup>®</sup>
[3] => asd
[4] => <img alt="qwe®qwe" />
[5] => </div>
)
*/
foreach ($tokens as &$token)
{
if ($token[0] == "<") continue; // Skip tokens that are tags
$token = substr_replace('®', '<sup>®</sup>');
}
$tokens = join("", $tokens); // reassemble the string
// $tokens => "<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>"
Note that this is a naive approach, and if the output isn't formatted as expected it might not parse like you'd like (again, regular expression is not good for HTML parsing ;) )