ansaurus

Question

Regular Expression for nested tags (Wikimedia content)

Answer 1

+2 A:

Make your wildcard non-greedy by appending a question mark:

(\[\[)(?<category>.*?[^\]#])([\]])

                    ^
                    Here is the edit

That will make it match the individual categories.

RichieHindle 2009-07-22 22:53:36

I was never a fan of non-greedy matching - I usually prefer to specify what it is I don't want in my match - but +1 for the easy fix.

Chris Lutz 2009-07-22 22:58:56

Non-greedy quantifiers are the silver bullet of regexes. Somebody asks a regex question, someone else tells them to use reluctant quantifiers, it works, everyone's happy. And neither of them has any idea *why* it worked.

Alan Moore 2009-07-23 01:54:31

Answer 2

+3 A:

The problem is not the comma, the problem is that .* will match "]][[" just as well as anything else. * is greedy - it will match as much as it possibly can. To fix it, you could use the non-greedy version (as suggested by RichieHindle), or you could change .* to [^\]]* - greedy match anything except closing brackets. That should also do the trick.

Also, these are not "nested" tags - that would be [[tag [[inside]] tag]]. That's probably not what you want, as I don't think that means anything in Wikimedia markup.

Chris Lutz 2009-07-22 22:57:50

Answer 3

A:

I think you're making this a lot more complicated than it needs to be. Does this do what you want?

\[\[(?<category>[^\[\]]+)\]\]

Alan Moore 2009-07-23 01:42:41

Answer 4

A:

The comma isn't relevant at all. You could have confirmed that yourself with a simple test.

And there's no nesting involved here. Wikilinks aren't allowed to be nested anyway.

You need to ensure that your inner pattern can't match the double-bracket that closes a wikilink. That way, any time you do encounter a double-bracket sequence, it will stop accumulating more characters into the regex match. The problem in your regular expression is that .* matches everything. The easy way to fix that is to use a non-greedy modifier. That way, the match is terminated as soon as possible. If you don't want to do that or your regex library doesn't support it, though, then you need to explicitly exclude the sequence that should terminate the pattern.

A naïve approach would be to simply exclude closing brackets altogether: [^]]*. That's not good enough, though. A single closing bracket is allowed in a wikilink's text. Therefore, you need to accept a single bracket while excluding double brackets. This should do it:

\[\[       # 2 opening brackets
(?<category>
  (
    ]?     # optional bracket
    [^]]   # always a non-bracket
  )*
)
]]         # 2 closing brackets

That will accept a right bracket, but only if it's followed by a non-bracket to break the closing sequence.

Rob Kennedy 2009-07-23 02:20:48

ansaurus

tags:

views:

answers:

Regular Expression for nested tags (Wikimedia content)

related questions