I don't know C#, but here's a regex:
/\[a\s+[^\]]*\](?:\[[^\]]+\])*(.*?)(?:\[[^\]]+\])*\[\/a\]/
This will match [a ...][tag1][tag2][...][tagN]text[/tagN]...[tag2][tag1][/a]
and capture text
.
To explain:
- the
/.../
are common regex delimiters (like double quotes for strings). C# may just use strings to initialize regexes - in which case the forward slashes aren't necessary.
\[
and \]
match a literal [
and ]
character. We need to escape them with a backslash since square brackets have a special meaning in regexes.
[^\]]
is an example of a character class - here meaning any character that is not a close square bracket. The square brackets delimit the character class, the caret (^
) denotes negation, and the escaped close square bracket is the character being negated.
*
and +
are suffixes meaning match 0 or more and 1 or more of the previous pattern, respectively. So [^\]]*
means match 0 or more of anything except a close square bracket.
\s
is a shorthand for the character class of whitespace characters
(?:...)
allows you to group the contents into an atomic pattern.
(...)
groups like (?:...)
does, but also saves the substring that this portion of the regex matches into a variable. This is normally called a capture, since it captures this portion of the string for you to use later. Here, we are using a capture to grab the linktext.
.
matches any single character.
*?
is a suffix for non-greedy matching. Normally, the *
suffix is greedy, and matches as much as it can while still allowing the rest of the pattern to match something. *?
is the opposite - it matches as little as it can while still allowing the rest of the pattern to match something. The reason we use *?
here instead of *
is so that if we have multiple [/a]
s on a line, we only go as far as the next one when matching link text.
This will only remove [tag]
s that come at the beginning and end of the text, to remove any that come in the middle of the text (like [a href=""]a [b]big[/b] frog[/a]
), you'll need to do a second pass on the capture from the first, scrubbing out any text that matches:
/\[[^\]]+\]/