views:

42

answers:

4

I am trying to parse a wikitext file received through Wikipedia's API and the problem is that some of its templates (i.e. snippets enclosed in {{ and }}) are not automatically expanded into wikitext, so I have to manually look for them in the article source and replace them eventually. The question is, can I use regex in .NET to get the matches from the text ?

To try to make myself more clear, here is an example to illustrate what I mean:

For the string

{{ abc {{...}} def {{.....}} gh }}

there should be a single match, namely the entire string, so the longest possible match.

On the other hand, for "orphaned" braces such as in this example:

{{ abc {{...}}

the result should be a single match: {{...}}

Could anyone offer me a suggestion ? Thanks in advance.

+1  A: 

Don't do it with regex. Go through the string left to right and if you encounter a {{ push its position on a stack, and on a }} pop the position of the previous {{ from the stack and calculate the length. Then you can easily take the maximum of these length.

CodeInChaos
You're right, i tried using a stack and indeed it's a more suitable approach in this case. I am not very comfortable with regular expressions yet, but i suspect that the regex solutions would not always work as expected if there were unpaired braces in the string.
Gabriel S.
+2  A: 

You can do this with .NET regex using balancing groups definition.

The example given in the documentation shows how it works with nestable < and >. You can easily adapt the delimiters to {{ and }}. You can adapt it further to allow for single { and } within the "text" if you want.

Remember that { and } are regex metacharacters; to match literally, you can escape to \{ and \}.

polygenelubricants
A: 

This regex pattern matches any arbitrary numbers of you mentioned pattern.

\{\{(?:[^{]+\{\{[^}]+\}\})+[^}]+\}\}

For the second request, you'll need a different regex:

\{\{.*?\}\}
Vantomex