views:

224

answers:

3

My regex skills are pretty poor, and most of the time they make me feel stupid. Can anyone help?

This question is more concerned with better mastery of regex than the job of extracting information from mud soup, so if my understanding of the mediawiki template system is flawed, I don't really mind that much. I'll spot it soon enough.

I'm parsing MediaWiki markup, and I'm trying to grab MediaWiki template names. These denoted by something like:

{{Template Name|other stuff

or

{{Template Name}}

If a # immediately follows the braces :

{{#Other thing

I'd like to ignore it.

So...

I'd like to match 2 curly braces {{ not followed by # up until the next occurrence of either | (pipe) or }} (2 closing curlies)

So:

{{I am a frog|some other stuff match

{{#I am a frog|some other stuff fail

garbage here{{Monkey}}bla bla match

garbage here{{#Monkey}}bla bla fail

etc...

The following regex covers this (I think):

\{{2}(?!\#)(.*?)(?:\||\}\})

but also matches:

some stuff here {{{Giraffe|oijq

How can I make it fail if there are not exactly 2 opening curly braces?

EDIT: .net regex, btw

A: 
(?<!\{)\{{2}(?!\#)(.*?)(?:\||\}\})

The zero-width negative look-behind

(?<!\{)

only matches a position that is not directly after a curly brace.

Amber
Funny... that's what I came up with, but it doesn't work.
spender
+2  A: 

You probably want to use a zero-width negative lookbehind/ahead assertion

Lookbehind has the same effect, but works backwards. It tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there. (?<!a)b matches a "b" that is not preceded by an "a", using negative lookbehind. It will not match "cab", but will match the b (and only the b) in "bed" or "debt". (?<=a)b (positive lookbehind) matches the b (and only the b) in cab, but does not match bed or debt.

So:

(?<!\{)\{{2}?(?!\#)(.*?)(?:\||\}\})

The other issue I just noticed, the (.*?) would match the third curly... Instead, try adding the third curly to the negative lookahead you are using for # already

(?<!\{)\{{2}(?!\{*\#|\{+)(.*?)(?:\||\}\})
gnarf
Same as my comment to Dav. That doesn't seem to do it.
spender
updated answer - not sure if you'll need to escape the # or { in a set, i don't think you need to.
gnarf
OK. Giving you answer as you were correct about (.*?) matching the third brace, which took me to the answer. Ended up with the following : (?<!\{)\{{2}(?!\{*\#|\{+)(.*?)(?:\||\}\})
spender
Cool - Editing the Answer to include that as the last example
gnarf
A: 

A maybe hackish wau would basically do a OR NOT with the same regex pattern repeated, except make it match 3 or more curly braces. Probably not the most elegant solution though. Good luck.

AaronLS
Care to elaborate? As I said, I'm stupid tonight!
spender