views:

410

answers:

4

I need some help with regex:

I got a html output and I need to wrap all the registration trademarks with a <sup></sup>

I can not insert the <sup> tag in title and alt properties and obviously I don't need to wrap regs that are already superscripted.

The following regex matches text that is not part of a HTML tag:

(?<=^|>)[^><]+?(?=<|$)

An example of what I'm looking for:

$original = `<div>asd&reg; asdasd. asd<sup>&reg;</sup>asd <img alt="qwe&reg;qwe" /></div>`

The filtered string should output:

<div>asd<sup>&reg;</sup> asdasd. asd<sup>&reg;</sup>asd <img alt="qwe&reg;qwe" /></div>

thanks a lot for your time!!!

+2  A: 

I would really use an HTML parser in place of regular expressions, since HTML is not regular and will present more edge cases than you can dream of (ignoring your contextual limitations that you've identified above).

You don't say what technology you're using. If you post that up, someone can undoubtedly recommend the appropriate parser.

Brian Agnew
got your point... ;)I'm working on a drupal(php) project... and I must use only "out of the box" stuff...that's why I'm looking for a regex... so I can use it as pattern in a preg_replace... :/
Wil
A: 

Regex is not enough for what you want. First you must write code to identify when content is a value of an attribute or a text node of an element. Then you must through all that content and use some replace method. I am not sure what it is in PHP, but in JavaScript it would look something like:

content[i].replace(/\&reg;/g, "<sup>&reg;</sup>");
A: 

I agree with Brian that regular expressions are not a good way to parse HTML, but if you must use regular expressions, you could try splitting the string into tokens and then running your regexp on each token.

I'm using preg_split to split the string on HTML tags, as well as on the phrase <sup>&reg</sup> -- this will leave text that's either not an already superscript &reg; or a tag as tokens. Then for each token, &reg; can be replaced with <sup>&reg;</sup>:

$regex = '/(<sup>&reg;<\/sup>|<.*?>)/i';
$original = '<div>asd&reg; asdasd. asd<sup>&reg;</sup>asd <img alt="qwe&reg;qwe" /></div>';

// we need to capture the tags so that the string can be rebuilt
$tokens = preg_split($regex, $original, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
/* $tokens => Array
(
    [0] => <div>
    [1] => asd&reg; asdasd. asd
    [2] => <sup>&reg;</sup>
    [3] => asd
    [4] => <img alt="qwe&reg;qwe" />
    [5] => </div>
)
*/

foreach ($tokens as &$token)
{
    if ($token[0] == "<") continue; // Skip tokens that are tags
    $token = substr_replace('&reg;', '<sup>&reg;</sup>');
}

$tokens = join("", $tokens); // reassemble the string
// $tokens => "<div>asd<sup>&reg;</sup> asdasd. asd<sup>&reg;</sup>asd <img alt="qwe&reg;qwe" /></div>"

Note that this is a naive approach, and if the output isn't formatted as expected it might not parse like you'd like (again, regular expression is not good for HTML parsing ;) )

Daniel Vandersluis
+1  A: 

Well, here is a simple way, if you agree to following limitation:

Those regs that are already processed have the </sup> following right after the &reg;

echo preg_replace('#&reg;(?!\s*</sup>|[^<]*>)#','<sup>&reg;</sup>', $s);

The logic behind is:

  1. we replace only those &reg; which are not followed by </sup> and...
  2. which are not followed by > simbol without opening < symbol
disjunction
thanks a lot guys!I'm gonna take this solution for my case... but I thank you all for the suggestions...anything else about it I'll let you know!thx!!!
Wil