views:

75

answers:

2

I writing BBcode converter to html.
Converter should skip unclosed tags.

I thought about 2 options to do it:
1) match all tags in once using one regex call, like:

Regex re2 = new Regex(@"\[(\ /?(?:b|i|u|quote|strike))\]");
MatchCollection mc = re2.Matches(sourcestring);

and then, loop over MatchCollection using 2 pointers to find start and open tags and than replacing with right html tag.

2) call regex multiple time for every tag and replace directly:

Regex re = new Regex(@"\[b\](.*?)\[\/b\]"); 
string s1 = re.Replace(sourcestring2,"<b>$1</b>");

What is more efficient?

The first option uses one regex but will require me to loop through all tags and find all pairs, and skip tags that don't have a pair.
Another positive thins is that I don't care about the content between the tags, i just work and replace them using the position.

In second option I don't need to worry about looping and making special replace function.
But will require to execute multiple regex and replaces.

What can you suggest?

If the second option is the right one, there is a problem with regex \[b\](.*?)\[\/b\]

how can i fix it to also match multi lines like:

[b]
        test 1
[/b]

[b]
        test 2
[/b]
+1  A: 
r = new System.Text.RegularExpressions.Regex(@"(?:\[b\])(?<name>(?>\[b\](?<DEPTH>)|\[/b\](?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:\[/b\])", System.Text.RegularExpressions.RegexOptions.Singleline);

 var s = r.Replace("asdfasdf[b]test[/b]asdfsadf", "<b>$1</b>");

That should give you only elements that have matched closing tags and also handle multi line (even though i specified the option of SingleLine it actually treats it as a single line)

It should also handle [b][b][/b] properly by ignoring the first [b].

As to whether or not this method is better than your first method I couldn't say. But hopefully this will point you in the right direction.

Code that works with your example below: System.Text.RegularExpressions.Regex r;

r = new System.Text.RegularExpressions.Regex(@"(?:\[b\])(?<name>(?>\[b\](?<DEPTH>)|\[/b\](?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:\[/b\])", System.Text.RegularExpressions.RegexOptions.Singleline);

var s = r.Replace("[b]bla bla[/b]bla bla[b] " + "\r\n" + "bla bla [/b]", "<b>$1</b>");
BuildStarted
it is not working on string like:"[b]bla bla[/b]bla bla[b] " + "\r\n" + "bla bla [/b]";
ilann
Tested it with that exact code and it worked just fine so i'm not quite sure what the problem could be. But if you could copy the code above and check that :)Edit: moving code into my answer
BuildStarted
+1  A: 

One option would be to use more SAX-like parsing, where instead of looking for a particular regex you look for [, then have your program handle that even in some manner, look for the ], handle that even, etc. Although more verbose than the regex it may be easier to understand, and wouldn't necessarily be slower.

nearlymonolith