views:

174

answers:

4

In a project I have a text with patterns like that:

{| text {| text |} text |}
more text

I want to get the first part with brackets. For this I use preg_match recursively. The following code works fine already:

preg_match('/\{((?>[^\{\}]+)|(?R))*\}/x',$text,$matches);

But if I add the symbol "|", I got an empty result and I don't know why:

preg_match('/\{\|((?>[^\{\}]+)|(?R))*\|\}/x',$text,$matches);

I can't use the first solution because in the text something like { text } can also exist. Can somebody tell me what I do wrong here? Thx

A: 

See http://stackoverflow.com/questions/1896647/php-help-with-my-regex-based-recursive-function

To adapt it to your use

preg_match_all('/\{\|(?:^(\{\||\|\})|(?R))*\|\}/', $text, $matches);
Tor Valamo
A: 

Thanks for your fast answer. I already had a look at this page and many others.

I also tried something like you posted and I tested your solution, but still. I get an empty result. Here is an example of a text I have to extract:

{| border=1 align=right cellpadding=4 cellspacing=0 width=250 style="margin: 0 0 1em 1em; background: #f9f9f9; border: 1px #aaaaaa solid; border-collapse: collapse; font-size: 95%;"
|+<big><big>'''남극'''</big></big>
| align=center colspan=2 style="background:#f9f9f9;" |

[[파일:LocationAntarctica.png|250px|남극의 위치.]]
|-
| '''[[면적]]''' || 14,000,000&nbsp;km²
|-
| '''[[인구]]''' || ~1000 (비상주인구)
|-
| '''[[정부]]''' <br />
|| [[남극 조약]]<br />
- 현재 사무국장 <br />[[요하네스 후버]]
|-
| '''[[남극 조약#영유권 주장 회원국|영토<br />주장국]]''' || {{ARG}} <br /> {{AUS}} <br /> {{CHL}} <br /> {{FRA}} <br /> {{NZL}} <br /> {{NOR}} <br /> {{국기나라|영국}}  
|-
| '''인터넷 도메인''' || [[.aq]]
|-
| '''국제 전화''' || +672
|}

[[파일:Antarctica_6400px_from_Blue_Marble.jpg|thumb|250px|남극의 인공위성 합성사진.]]
[[파일:AntarcticaDomeCSnow.jpg|thumb|200px|남극의 모습]]

'''남극'''(南極)은 [[지구]]의 최남단에 있는 대륙으로, 한가운데 [[남극점]]이 있다. 남극 대륙은 거의 대부분 [[남극권]] 이남에 자리잡고 있으며, 주변에는 [[남극해]]가 있다. 면적은 약 1,440만 km²로서 [[아시아]], [[아프리카]], [[북아메리카]], [[남아메리카]]에 이어 다섯 번째로 큰 대륙이다. 남극의 약 98%가 얼음으로 덮여 있는데, (얼음으로 덮이지 않은 면적은 약 280,000 ㎢에 불과함) 이 얼음은 평균 두께가 1.6km에 이른다. 
...
Prog
You should edit your question to include this information, and delete this answer.
Roger Pate
+3  A: 

Try this:

'/(?s)\{\|(?:(?:(?!\{\||\|\}).)++|(?R))*\|\}/'

In your original regex you use the character class [^{}] to match anything except a delimiter. That's fine when the delimiters are only one character, but yours are two characters. To not-match a multi-character sequence you need something this:

(?:(?!\{\||\|\}).)++

The dot matches any character (including newlines, thank to the (?s)), but only after the lookahead has determined that it's not part of a {| or |} sequence. I also dropped your atomic group ((?>...)) and replaced it with a possessive quantifier (++) to reduce clutter. But you should definitely use one or the other in that part of the regex to prevent catastrophic backtracking.

Alan Moore
I just tried your solution and it works well. Thank you very much! And also thanks for the explanation, because it's not easy to understand.
Prog
+1  A: 

You've got a few suggestions for working regular expressions, but if you're wondering why your original regexp failed, read on. The problem lies when it comes time to match a closing "|}" tag. The (?>[^{}]+) (or [^{}]++) sub expression will match the "|", causing the |} sub expression to fail. With no backtracking in the sub expression, there's no way to recover from the failed match.

outis