tags:

views:

1226

answers:

6

I need a regular expression to strip out any BBCode in a string. I've got the following (and an array with tags):

new RegExp('\\[' + tags[index] + '](.*?)\\[/' + tags[index] + ']');

It picks up [tag]this[/tag] just fine, but fails when using [url=http://google.com]this[/url].

What do I need to change? Thanks a lot.

+1  A: 

You have to allow any character other than ']' after a tag until you find ' ]'.

new RegExp('\\[' + tags[index] + '[^]]*](.*?)\\[/' + tags[index] + ']');

You could simplify this to the following expression.

\[[^]]*]([^[]*)\[\\[^]]*]

The problem with that is, that it will match [WrongTag]stuff[\WrongTag], too. Matching nested tags requires using the expression multiple times.

Daniel Brückner
Why should you be at all interested in tag nesting when your goal is to take out any BBcode tags anyway?
Tomalak
[^]] needs escaping to [^\\]]
Question Mark
A: 

I think

new RegExp('\\[' + tags[index] + '(=[^\\]]+)?](.*?)\\[/' + tags[index] + ']');

should do it. Instead of group 1 you have to pick group 2 then.

rudolfson
[^\\]] does not match characters other than ']' but characters other than '\' followed by ']' because you must not escape ']' in the first position. Correct is [^]].
Daniel Brückner
+2  A: 

To strip out any BBCode, use something like:

string alltags = tags.Join("|");
RegExp stripbb = new RegExp('\\[/?(' + alltags + ')[^]]*\\]');

Replace globally with the empty string. No extra loop necessary.

Tomalak
[^\\]] does not match characters other than ']' but characters other than '\' followed by ']' because you must not escape ']' in the first position. Correct is [^]].
Daniel Brückner
There is no "followed by" in a character class. If anything, the character class matches everything except "\" and "]". I'll take out the surplus backslash.
Tomalak
+1  A: 

You can check for balanced tags using a backreference:

 new RegExp('\\[(' + tags.Join('|') + ')[^]]*](.*?)\\[/\\1]');

The real problem is that you cant't match arbitrary nested tags in a regular expression (that's the limit of a regular language). Some languages do allow for recursive regular expressions, but those are extensions (that technically make them non-regular, but doesn't change the name that most people use for the objects).

If you don't care about balanced tags, you can just strip out any tag you find:

 new RegExp('\\[/?(?:' + tags.Join('|') + ')[^]]*]');
rampion
Balancing tags is totally irrelevant here. The OP wants the tags removed, not matched.
Tomalak
A: 

I came across this thread and found it helpful to get me on the right track, but here's an ultimate one I spent two hours building (it's my first RegEx!) for JavaScript and tested to work very well for crazy nests and even incorrectly nested strings, it just works!:

string = string.replace(/\[\/?(?:b|i|u|url|quote|code|img|color|size)*?.*?\]/img, '');

If string = "[b][color=blue][url=www.google.com]Google[/url][/color][/b]" then the new string will be "Google". Amazing.

Hope someone finds that useful, this was a top match for 'JavaScript RegEx strip BBCode' in Google ;)

A: 

Remember that many (most?) regex flavours by default do not let the DOT meta character match line terminators. Causing a tag like

"[foo]dsdfs
fdsfsd[/foo]"

to fail. Either enable DOTALL by adding "(?s)" to your regex, or replace the DOT meta char in your regex by the character class [\S\s].

Bart Kiers