tags:

views:

867

answers:

6

Is there a way to determine how many capture groups there are in a given regular expression?

I would like to be able to do the follwing:

def groups(regexp, s):
    """ Returns the first result of re.findall, or an empty default

    >>> groups(r'(\d)(\d)(\d)', '123')
    ('1', '2', '3')
    >>> groups(r'(\d)(\d)(\d)', 'abc')
    ('', '', '')
    """
    import re
    m = re.search(regexp, s)
    if m:
        return m.groups()
    return ('',) * num_of_groups(regexp)

This allows me to do stuff like:

first, last, phone = groups(r'(\w+) (\w+) ([\d\-]+)', 'John Doe 555-3456')

However, I don't know how to implement num_of_groups. (Currently I just work around it.)

EDIT: Following the advice from rslite, I replaced re.findall with re.search.

sre_parse seems like the most robust and comprehensive solution, but requires tree traversal and appears to be a bit heavy.

MizardX's regular expression seems to cover all bases, so I'm going to go with that.

A: 

First of all if you only need the first result of re.findall it's better to just use re.search that returns a match or None.

For the groups number you could count the number of open parenthesis '(' except those that are escaped by '\'. You could use another regex for that:

def num_of_groups(regexp):
    rg = re.compile(r'(?<!\\)\(')
    return len(rg.findall(regexp))

Note that this doesn't work if the regex contains non-capturing groups and also if '(' is escaped by using it as '[(]'. So this is not very reliable. But depending on the regexes that you use it might help.

rslite
A: 

The lastindex property of the match object should be what you are looking for. See the re module docs.

agnul
If no match is found, I don't have a match object. Plus, I don't think that's what lastindex does.
itsadok
+2  A: 

Something from inside sre_parse might help.

At first glance, maybe something along the lines of:

>>> import sre_parse
>>> sre_parse.parse('(\d)\d(\d)')
[('subpattern', (1, [('in', [('category', 'category_digit')])])), 
('in', [('category', 'category_digit')]), 
('subpattern', (2, [('in', [('category', 'category_digit')])]))]

I.e. count the items of type 'subpattern':

import sre_parse

def count_patterns(regex):
    """
    >>> count_patterns('foo: \d')
    0
    >>> count_patterns('foo: (\d)')
    1
    >>> count_patterns('foo: (\d(\s))')
    1
    """
    parsed = sre_parse.parse(regex)
    return len([token for token in parsed if token[0] == 'subpattern'])

Note that we're only counting root level patterns here, so the last example only returns 1. To change this, tokens would need to searched recursively.

miracle2k
A: 

Might be wrong, but I don't think there is a way to find the number of groups that would have been returned had the regex matched. The only way I can think of to make this work the way you want it to is to pass the number of matches your particular regex expects as an argument.

To clarify though: When findall succeeds, you only want the first match to be returned, but when it fails you want a list of empty strings? Because the comment seems to show all matches being returned as a list.

Adam Bellaire
A: 

Using your code as a basis:

def groups(regexp, s):
    """ Returns the first result of re.findall, or an empty default

    >>> groups(r'(\d)(\d)(\d)', '123')
    ('1', '2', '3')
    >>> groups(r'(\d)(\d)(\d)', 'abc')
    ('', '', '')
    """
    import re
    m = re.search(regexp, s)
    if m:
        return m.groups()
    return ('',) * len(m.groups())
Will Boyce
This will throw an exception when no match is found
itsadok
+4  A: 
def num_groups(regex):
    pattern = re.compile(r"(?<!\\)(?:\\\\)*(?:\[(?:\\.|[^\\\]])*\]|(\()(?!\?(?!P<)))")
    return len([ 1 for x in re.finditer(pattern, regex) if x.group(1) ])

It looks for unescaped '[' or '('. For '[' it looks for the next unescaped ']'. The '(' can't be followed by a '?', unless that is followed by 'P<'. (Named groups.) It then filters for the capturing groups, and counts them.

Also looking for character classes is necessary, because '(' can appear unescaped inside them. Using look-arounds to detect them is not possible, since look-behinds need to be of fixed length.


EDIT: (4 months later)

Too simple!

def num_groups(regex):
    return re.compile(regex).groups
MizardX
+1 for the edited version :-)
Carl Meyer