views:

85

answers:

5

What reg expression patten to I need to match everything between {{ and }}

I'm trying to parse wikipedia, but im ending up with orphan }} after running the rexex code. Here's my PHP script.

<?php

$articleName='england';

$url = "http://en.wikipedia.org/wiki/Special:Export/" . $articleName;
ini_set('user_agent','custom agent'); //required so that Wikipedia allows our request.

$feed = file_get_contents($url);
$xml = new SimpleXmlElement($feed);

$wikicode = $xml->page->revision->text;



$wikicode=str_replace("[[", "", $wikicode);
$wikicode=str_replace("]]", "", $wikicode);
$wikicode=preg_replace('/\{\{([^}]*(?:\}[^}]+)*)\}\}/','',$wikicode);

print($wikicode);

?>

I think the problem is I have nested {{ and }} e.g.

{{ something {{ something else {{ something new }}{{ something old }} something blue }} something green }}

+4  A: 

You can use:

\{\{(.*?)\}\}

Most regex flavors treat the brace { as a literal character, unless it is part of a repetition operator like {x,y} which is not the case here. So you do not need to escape it with a backslash, though doing it will give the same result.

So you can also use:

{{(.*?)}}

Sample:

$ echo {{StackOverflow}} | perl -pe 's/{{(.*?)}}/$1/'
StackOverflow

Also note that the .* which matches any character(other than newline) is used here in non-greedy way. So it'll try to match as little as possible.

Example:

In the string '{{stack}}{{overflow}}' it will match 'stack' and not 'stack}}{{overflow'.
If you want the later behavior you can change .*? to .*, making the match greedy.

codaddict
Prob should point out this is a non-greedy match, that is will match to the first occurance of }}, and not the last - which may or may not be want the OP requires.
Richard
A: 

\{{2}(.*)\}{2} or, cleaner, with lookarounds (?<=\{{2}).*(?=\}{2}), but only if your regex engine supports them.

If you want your match to stop at the first found }} (i.e. non-greedy) you should replace .* with .*?.

Also you should take into account the settings for single-line matching of your engine as in some of them . will not match new line characters by default. You can either enable single-line or use [.\r\n]* instead of .*.

Alin Purcaru
Do you think `\{{2}` is more readable than `\{\{`?
Gumbo
Try it on `{{foo}} bar {{baz}}`.
KennyTM
@Gumbo No I do not think it is more readable but I tend to avoid redundancy. Maybe he'll want to match 3 {s next time. @KennyTM I posted the answer mainly to suggest lookarounds because the asker wanted just `...`, and not `{{...}}`. I updated it now with more info he might need.
Alin Purcaru
@Stuart. You should update your question, not post comments. Just copy them from here and then delete them.
Alin Purcaru
It's *single-line* or (preferably) **DOTALL** mode that allows `.` to match newlines; *multiline* mode alters the behavior of the anchors, `^` and `$`.
Alan Moore
A: 

Besides using a already mentioned non-greedy quantifier, you can also use this:

\{\{(([^}]|}[^}])*)}}

The inner ([^}]|}[^}])* is used to only match sequences of zero or more arbitrary characters that do not contain the sequence }}.

Gumbo
A: 

A greedy version to get the shortest match is

\{\{([^}]*(?:\}[^}]+)*)\}\}

(For comparison, with the string {{fd}sdfd}sf}x{dsf}}, the lazy version \{\{(.*?)\}\} takes 57 steps to match, while my version only takes 17 steps. This assumes the debug output of Regex Buddy can be trusted.)

KennyTM
+2  A: 

Your edit shows that you're trying to do a recursive match, which is very different from the original question. If you weren't just deleting the matched text I would advise you not to use regexes at all, but this should do what you want:

$wikicode=preg_replace('~{{(?:(?:(?!{{|}}).)++|(?R))*+}}~s',
                       '', $wikicode);

After the first {{ matches an opening delimiter, (?:(?!{{|}}).)++ gobbles up everything until the next delimiter. If it's another opening delimiter, the (?R) takes over and applies the whole regex again, recursively.

(?R) is about as non-standard as regex features get. It's unique to the PCRE library, which is what powers PHP's regex flavor. Some other flavors have their own ways of matching recursive structures, all of them very different from each other.

Alan Moore
you're a genius! thank you!
Stuart