tags:

views:

53

answers:

3

I'm using a RegEx on an XML dump of a Wikipedia article.

The Regex is = {{[a-zA-Z0-9_\(\)\|\?\s\-\,\/\=\[\]\:.]+}}

I want to detect all the text wrapped with {{ and }}. But instead of detecting 56 matched which I got from simple search with {{, it only detects 45.

a sample block it doesn't detect is, {{cite journal | last = Heeks | first = Richard | year = 2008 | title = Meet Marty Cooper - the inventor of the mobile phone | journal = BBC | volume = 41 | issue = 6 | url = http://news.bbc.co.uk/2/hi/programmes/click_online/8639590.stm | pages = 26–33 | doi = 10.1109/MC.2008.192 }} ..

but it detects, {{cite web | title = Of Cigarettes and Cellphones | last = Ulyseas | first = Mark | date = 2008-01-18 | url = http://www.thebalitimes.com/2008/01/18/of-cigarettes-and-cellphones/ | publisher = The Bali Times | accessdate = 2008-02-24 }}

can anyone please detect me the problem?

+2  A: 

Some of the escaping is superfluous, but I don't think that's the real problem.

I recommend trying \w instead of a-zA-Z0-9_, especially because in .NET regex \w also recognizes Unicode letter (unless it's in ECMAScript compliant mode).

Another alternative is that if the text part can not contain } (which right now it can't anyway), you can also use simply {{[^}]+}}.

The [^...] is a negated character class. [^}] matches anything but }.

References

Related questions

polygenelubricants
Thanx for the suggestion. But the problem still persists. :(
Bibhas
@Bibhas: recreate a short snippet to demonstrate this. Hardcode the string etc. Maybe link to ideone.com snippet.
polygenelubricants
Hey, Thank you very much dude.. `{{[^}]+}}` worked like a charm and detected all the 56 instances without using any other escapes or matches. Awesome. Many many thanx. :)It was stupid why i didn't think about it. :-\
Bibhas
@Bibhas: depending on whether or not `}` (perhaps escaped) can appear in the text or not, `[^}]+` may not work for _all_ cases.
polygenelubricants
I'm aware of that. and will keep an eye on the contents. But usually, its not like that in those articles. but its a case. :)
Bibhas
And I can always use `^(}})` as there wont be a case to appear `}}` inside the needed block. :)
Bibhas
@Bibhas: Unfortunately no, you can't do that. A character class by itself can only match one char at a time, not a sequence like `}}`. Further discussion: http://stackoverflow.com/questions/3148240/regex-why-doesnt-01-12-range-work-as-expected/ ; If the delimiter is fixed to `{{` and `}}`, then perhaps just using good old-fashioned string search is simpler than regex.
polygenelubricants
+1  A: 

Your character class is...special. For starters, everything you're matching is covered by the . at the end. Also, curly braces ({}) are special characters, so they should be escaped. Finally, you'll want to force it not to be greedy by adding a ? after that +, otherwise it will match curly braces.

EDIT: I won't try to go back on what I said, but I would like to note that I was mistaken about pretty much everything in this post (other than that braces should be escaped, which is just a matter of good practice).

macamatic
Curly braces aren't included in the `[...]`, so greedy vs reluctant is a non-issue.
polygenelubricants
Actually, the . at the end only matches the literal ., because it is in a character class.
Billy ONeal
My mistake, I totally forgot that rule...the curly braces should still be escaped, although that obviously isn't the problem (I'm guessing the regex parser is smart enough to treat the { as a literal since there was no character class prior).
macamatic
@macamatic (+1): Some flavors don't require curly braces to be escaped depending on context, and .NET is one of those (otherwise the pattern would simply throw an exception instead of missing some matches). I think there is argument for taking advantage of flavor-specific quirks/features if it makes your life easier, which is why regex questions should always specify flavor whenever possible. You are right, though, that in some flavors, the braces will need to be escaped.
polygenelubricants
I suppose, but if you ask me, the benefit of not having backslashes before your braces is severely outweighed by the advantages: if you always escape, you will never mistakenly leave out a backslash where it's necessary, whether that's because it's required in context or because of a difference in regex implementation. On the other hand, the same principles are used to justify Allman style formatting, which I despise, so to each his own.
macamatic
I always escape braces (and square brackets inside character classes), whether I need them or not. It gives me one less cross-flavor idiosyncrasy to keep track of. Or is that two? ;)
Alan Moore
A: 

The regex {{(.*?)}} works well for me in perl. It catches everything in between 2 nested braces.

Sir Wobin
`.*?` should be discouraged if better alternatives exist (and they probably do in this case).
polygenelubricants
This also works. :) Thanx. :D
Bibhas
@polygenelubricants, Will you please explain why it should be discouraged? For my record.
Bibhas
@Bibhas: generally with regex you want to be as specific as practically possible. `.*?` may be necessary in some cases, but this "take anything, but not too much" construct is usually just a poor substitute for "take everything except these things". Negated character class alternative usually exist for these scenarios, and would be much better. Further discussion with generic examples: http://stackoverflow.com/questions/3075130/difference-between-and-for-regex/3075532#3075532
polygenelubricants