tags:

views:

185

answers:

4

Haven't done regex in awhile, and am a bit rusty.

I'm trying to parse the categories out of a Wikipedia entry. What I need are the individual strings contained in a pattern that starts with two open brackets and ends with two closing brackets.

This query works most of the time -

(\[\[)(?<category>.*[^\]#])([\]])

but has issues when the closing brackets have a comma (',') next to them.

This has the unfortunate result that when parsing the following text -

nlocation = [[Seattle, Washington]], [[United States|USA]]|

it extracts the following for "category"

Seattle, Washington]], [[United States|USA

Clearly, the comma is throwing this off and it is finding the next set. What's the best way to capture every value between open and closed double brackets?

+2  A: 

Make your wildcard non-greedy by appending a question mark:

(\[\[)(?<category>.*?[^\]#])([\]])

                    ^
                    Here is the edit

That will make it match the individual categories.

RichieHindle
I was never a fan of non-greedy matching - I usually prefer to specify what it is I don't want in my match - but +1 for the easy fix.
Chris Lutz
Non-greedy quantifiers are the silver bullet of regexes. Somebody asks a regex question, someone else tells them to use reluctant quantifiers, it works, everyone's happy. And neither of them has any idea *why* it worked.
Alan Moore
+3  A: 

The problem is not the comma, the problem is that .* will match "]][[" just as well as anything else. * is greedy - it will match as much as it possibly can. To fix it, you could use the non-greedy version (as suggested by RichieHindle), or you could change .* to [^\]]* - greedy match anything except closing brackets. That should also do the trick.

Also, these are not "nested" tags - that would be [[tag [[inside]] tag]]. That's probably not what you want, as I don't think that means anything in Wikimedia markup.

Chris Lutz
A: 

I think you're making this a lot more complicated than it needs to be. Does this do what you want?

\[\[(?<category>[^\[\]]+)\]\]
Alan Moore
A: 

The comma isn't relevant at all. You could have confirmed that yourself with a simple test.

And there's no nesting involved here. Wikilinks aren't allowed to be nested anyway.

You need to ensure that your inner pattern can't match the double-bracket that closes a wikilink. That way, any time you do encounter a double-bracket sequence, it will stop accumulating more characters into the regex match. The problem in your regular expression is that .* matches everything. The easy way to fix that is to use a non-greedy modifier. That way, the match is terminated as soon as possible. If you don't want to do that or your regex library doesn't support it, though, then you need to explicitly exclude the sequence that should terminate the pattern.

A naïve approach would be to simply exclude closing brackets altogether: [^]]*. That's not good enough, though. A single closing bracket is allowed in a wikilink's text. Therefore, you need to accept a single bracket while excluding double brackets. This should do it:

\[\[       # 2 opening brackets
(?<category>
  (
    ]?     # optional bracket
    [^]]   # always a non-bracket
  )*
)
]]         # 2 closing brackets

That will accept a right bracket, but only if it's followed by a non-bracket to break the closing sequence.

Rob Kennedy