tags:

views:

586

answers:

6

Is there any way to beat the 100-group limit for regular expressions in Python? Also, could someone explain why there is a limit. Thanks.

+11  A: 

There is a limit because it would take too much memory to store the complete state machine efficiently. I'd say that if you have more than 100 groups in your re, something is wrong either in the re itself or in the way you are using them. Maybe you need to split the input and work on smaller chunks or something.

Keltia
I agree with your sentiment. If you're hitting the 100 group regex limit, I think there's something wrong with the design.
Kamil Kisiel
A: 

I would say you could reduce the number of groups by using non-grouping parentheses, but whatever it is that you're doing seems like you want all these groupings.

orip
+2  A: 

First, as others have said, there are probably good alternatives to using 100 groups. The re.findall method might be a useful place to start. If you really need more than 100 groups, the only workaround I see is to modify the core Python code.

In [python-install-dir]/lib/sre_compile.py simply modify the compile() function by removing the following lines:

# in lib/sre_compile.py
if pattern.groups > 100:
    raise AssertionError(
        "sorry, but this version only supports 100 named groups"
        )

For a slightly more flexible version, just define a constant at the top of the sre_compile module, and have the above line compare to that constant instead of 100.

Funnily enough, in the (Python 2.5) source there is a comment indicating that the 100 group limit is scheduled to be removed in future versions.

Triptych
+1  A: 

Can you explain why you need more than 100 groups? Perhaps we can help you find an alternate solution.

Suraj Barkale
+1  A: 

I'm not sure what you're doing exactly, but try using a single group, with a lot of OR clauses inside... so (this)|(that) becomes (this|that). You can do clever things with the results by passing a function that does something with the particular word that is matched:

 newContents, num = cregex.subn(lambda m: replacements[m.string[m.start():m.end()]], contents)

If you really need so many groups, you'll probably have to do it in stages... one pass for a dozen big groups, then another pass inside each of those groups for all the details you want.

Jim Carroll
I used the or method you mentioned and a few tricks of my own. Thanks Jim.
Evan Fosmark
A: 

in my case, i have a dictionary of n words and want to create a single regex that matches all of them.. ie: if my dictionary is

hello goodbye

my regex would be: (^|\s)hello($|\s)|(^|\s)goodbye($|\s) ... it's the only way to do it, and works fine on small dictionaries, but when you have more tan 50 words, well...

alex
Sounds like the same problem to me, with the same solution: `\b(hello|goodbye|whatever)\b` If that doesn't work, ask a new question so we can help you properly.
Alan Moore