ansaurus

Question

Answer 1

+2 A:

I would really use an HTML parser in place of regular expressions, since HTML is not regular and will present more edge cases than you can dream of (ignoring your contextual limitations that you've identified above).

You don't say what technology you're using. If you post that up, someone can undoubtedly recommend the appropriate parser.

Brian Agnew 2009-09-02 14:39:21

got your point... ;)I'm working on a drupal(php) project... and I must use only "out of the box" stuff...that's why I'm looking for a regex... so I can use it as pattern in a preg_replace... :/

Wil 2009-09-02 14:55:24

Answer 2

A:

Regex is not enough for what you want. First you must write code to identify when content is a value of an attribute or a text node of an element. Then you must through all that content and use some replace method. I am not sure what it is in PHP, but in JavaScript it would look something like:

content[i].replace(/\&reg;/g, "<sup>&reg;</sup>");

2009-09-02 15:31:56

Answer 3

A:

I agree with Brian that regular expressions are not a good way to parse HTML, but if you must use regular expressions, you could try splitting the string into tokens and then running your regexp on each token.

I'm using preg_split to split the string on HTML tags, as well as on the phrase &reg -- this will leave text that's either not an already superscript ® or a tag as tokens. Then for each token, ® can be replaced with ®:

$regex = '/(<sup>&reg;<\/sup>|<.*?>)/i';
$original = '<div>asd&reg; asdasd. asd<sup>&reg;</sup>asd <img alt="qwe&reg;qwe" /></div>';

// we need to capture the tags so that the string can be rebuilt
$tokens = preg_split($regex, $original, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
/* $tokens => Array
(
    [0] => <div>
    [1] => asd&reg; asdasd. asd
    [2] => <sup>&reg;</sup>
    [3] => asd
    [4] => <img alt="qwe&reg;qwe" />
    [5] => </div>
)
*/

foreach ($tokens as &$token)
{
    if ($token[0] == "<") continue; // Skip tokens that are tags
    $token = substr_replace('&reg;', '<sup>&reg;</sup>');
}

$tokens = join("", $tokens); // reassemble the string
// $tokens => "<div>asd<sup>&reg;</sup> asdasd. asd<sup>&reg;</sup>asd <img alt="qwe&reg;qwe" /></div>"

Note that this is a naive approach, and if the output isn't formatted as expected it might not parse like you'd like (again, regular expression is not good for HTML parsing ;) )

Daniel Vandersluis 2009-09-02 16:06:30

Answer 4

+1 A:

Well, here is a simple way, if you agree to following limitation:

Those regs that are already processed have the following right after the ®

echo preg_replace('#&reg;(?!\s*</sup>|[^<]*>)#','<sup>&reg;</sup>', $s);

The logic behind is:

we replace only those ® which are not followed by and...
which are not followed by > simbol without opening < symbol

disjunction 2009-09-02 16:41:10

thanks a lot guys!I'm gonna take this solution for my case... but I thank you all for the suggestions...anything else about it I'll let you know!thx!!!

Wil 2009-09-02 17:02:56

ansaurus

tags:

views:

answers:

Regex to replace reg trademark

related questions