views:

55

answers:

3

In javascript, one of the popular regex is to strip out HTML tags from the text. The code for that is

String.prototype.stripHTML = function () { 
             var reTag = /<(?:.|\s)*?>/g; 
             return this.replace(reTag, "");
        };

If you try this on "<b>This would be bold</b>".stripHTML(), then it outputs as "This would be bold". Shouldn't it output as "" ?

Doesn't this regex says that match everything which starts with < and ends with > ? Why didn't this regex start at < of <b> and end at > of </b>

+1  A: 

It's not a greedy regex, meaning that it matches the first > it comes across, the <b> and </b> are separate matches.

Nick Craver
+3  A: 

You are using a non-greedy modifier.

(?:.|\s)*?
         ^

This causes the match to be the shortest possible, instead of the default which is to match the longest possible match.

<b>This would be bold</b>
^-^                  ^--^     Non-greedy: <(?:.|\s)*?>
^-----------------------^     Greedy    : <(?:.|\s)*>
Mark Byers
+2  A: 

Yes, but the *? performs an ungreedy match (short match):

var reTag = /<(?:.|\s)*?>/g; 

To perform reedy match (longest match possible), remove the ?:

var reTag = /<(?:.|\s)*>/g; 
aularon