ansaurus

Question

Matching Nested Structures With Regular Expressions in Python

Answer 1

+1 A:

I'd recommend removing the nesting from the regex itself, looping through the results and performing regexes on that.

FrozenFire 2009-07-08 16:34:13

Answer 2

+2 A:

Python doesn't support recursion in regular expressions. So equivalents to .NET's balancing groups or PCRE regex in Perl are not immediately possible in Python.

Like you said yourself: this is NOT a problem you should really be solving with a single regex.

Il-Bhima 2009-07-08 16:38:07

PCRE supports recursive patterns using the (?R) directive among other things. Python might have supported an older PCRE but not the more recent versions. http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions

Il-Bhima 2009-07-09 07:27:29

Answer 3

+1 A:

Are you talking about recursion? It's not clear from your question. An example:

ActivePython 2.6.1.1 (ActiveState Software Inc.) based on
Python 2.6.1 (r261:67515, Dec  5 2008, 13:58:38) [MSC v.1500 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> p = re.compile(r"((\w+((\d+)[.;]))(\s+)\d)")
>>> m = p.match("Fred99. \t9")
>>> m
<_sre.SRE_Match object at 0x00454F80>
>>> m.groups()
('Fred99. \t9', 'Fred99.', '9.', '9', ' \t')

This shows matching of nested groups. The numbering of groups depends on the order in which their opening parenthesis occurs in the pattern.

Vinay Sajip 2009-07-08 16:57:31

Answer 4

+6 A:

Regular expressions cannot parse nested structures. Nested structures are not regular, by definition. They cannot be constructed by a regular grammar, and they cannot be parsed by a finite state automaton (a regular expression can be seen as a shorthand notation for an FSA).

Today's "regex" engines sometimes support some limited "nesting" constructs, but from a technical standpoint, they shouldn't be called "regular" anymore.

Svante 2009-07-08 23:10:34

+1 for this important piece of information. It should be noted that adding nesting support is NOT harmless. One of the cool things about true regex engines is that need no extra memory while processing, except a constant amount to store the state machine and a variable to remember the current state. Another is the running speed, which I think is linear to the length of the input. Adding nesting support messes up both of these benefits.

harms 2009-07-09 15:20:53

Answer 5

+5 A:

You can't do this generally using Python regular expressions. (.NET regular expressions have been extended with "balancing groups" which is what allows nested matches.)

However, PyParsing is a very nice package for this type of thing:

from pyparsing import nestedExpr

data = "( (a ( ( c ) b ) ) ( d ) e )"
print nestedExpr().parseString(data).asList()

The output is:

[[['a', [['c'], 'b']], ['d'], 'e']]

ansaurus

tags:

views:

answers:

Matching Nested Structures With Regular Expressions in Python

related questions