ansaurus

Question

Hidden Features of RegEx

Answer 1

+2 A:

I won't try, I can't write anything more useful than this : Crucial Concepts Behind Advanced Regular Expressions.

Moayad Mardini 2009-05-15 11:46:38

Answer 2

+1 A:

Backreference construct For eg. to find doubled word characters use

(?<char>\w)\k<char>

Here wherever <char> occurs in regex, it matches \w

Backreferencing uses named groups to allow you to search for other instances of characters that match a wildcard. Backreferences provide a convenient way to find repeating groups of characters. They can be thought of as a shorthand instruction to match the same string again.

For example, the regular expression (?<char>\w)\k<char>, using named groups and backreferencing, searches for adjacent paired characters. When applied to the string "I'll have a small coffee," it finds matches in the words "I'll", "small", and "coffee". The metacharacter \w finds any single-word character. The grouping construct (?<char>) encloses the metacharacter to force the regular expression engine to remember a subexpression match (which, in this case, will be any single character) and save it under the name "char". The backreference construct \k<char> causes the engine to compare the current character to the previously matched character stored under "char". The entire regular expression successfully finds a match wherever a single character is the same as the preceding character.

Added:

An eg of usage of backreference would be to check if a particular tag is xml. Here it is mandatory that the start tag and end tag should match. Hence backreference is handy. Check out my example on expression to match xml tag.

Rashmi Pandit 2009-05-15 11:49:50

You might want to mention that you can do this with old-fashioned numbered capturing groups too, i.e.: (\w)\1

Alan Moore 2009-05-16 06:37:06

Answer 3

+2 A:

Atomic grouping. (And, if your engine has it, possessive quantifiers, which are a notational convenience for the commonest use case of Atomic grouping)

This is very often the answer to making your regular expressions fast. Anyone who's struggled with regular expressions knows how easy it is to accidentally write a regular expression that takes forever on long input. As a simple example, this regex:

([+\w-]+)@foo.com

Takes time proportional to n² to run against a string of n "x" characters. (with most engines; perl does some fancy optimization) It's easy to construct a regular expression that takes O(n³), O(n⁴), etc. time when failing to match, and even ones that take O(2ⁿ) time. Often, the way out of this is to signal to the regular expression engine that it shouldn't backtrack through certain constructs. That's where atomic grouping comes in.

This regular expression matches exactly the same things as the other one, but only takes O(n) time:

((?>[+\w-]+))@foo.com

The atomic group construct (?> ) tells the regular expression engine to not backtrack back inside the parentheses if subsequent characters fail to match. Instead, the entire expression should fail to match. This is what keeps the behavior linear.

You do have to be a little bit careful with atomic groups - sometimes you really do want backtracking - but it's a nice thing to start thinking of whenever you notice performance issues in your regular expressions.

(It's a shame Python doesn't support this feature - yet)

Daniel Martin 2009-05-15 14:36:20

Answer 4

+3 A:

Isn't everything in RegEx a "hidden" feature? How does the joke go?

I had a problem that I tried to solve with regular expressions. Now I have two problems.

:)

Bruce McGee 2009-08-13 02:02:01

199 hits and counting: http://www.google.com/search?q=regex+%22now+(you+OR+they+OR+he)+(have+OR+has)+two+problems%22+site:stackoverflow.com

Alan Moore 2009-08-13 07:53:23

I heard it a long time ago when I was the only Windows developer in a Unix shop. Those Unix guys are a laugh a minute.

Bruce McGee 2009-08-13 10:48:26

ansaurus

tags:

views:

answers:

Hidden Features of RegEx

related questions