ansaurus

Question

Overlapping matches with finditer() in Python

Answer 1

+2 A:

A character consumed is consumed, you should not ask the regex engine to go back.

From your examples the verse part (e.g. :1) seems not optional. Removing that will match the last bit.

ref_regex = re.compile('''
(?<!\w)                      # Not preceeded by any words
((?i)q(?:uote)?\s+)?            # Match 'q' or 'quote' followed by many spaces
(
    (?:(?:[1-3]|I{1,3})\s*)?    # Match an arabic or roman number between 1 and 3.
    [A-Za-z]+                   # Match many alphabetics
)\.?                            # Followed by an optional dot
(?:
    \s*(\d+)                    # Match the chapter number
    (?:
        [:.](\d+)               # Match the verse number
        (?:-(\d+))?             # Match the ending verse number
    )                    # <-- no '?' here
)
(?:
    \s+
    (?:
        (?i)(?:from\s+)|        # Match the keyword 'from' or 'in'
        (?:in\s+)|
        (?P<lbrace>\()      # or stuff between (...)
    )\s*(\w+)
    (?(lbrace)\))
)?
''', re.X | re.U)

(If you're going to write a gigantic RegEx like this, please use the /x flag.)

If you really need overlapping matches, you could use a lookahead. A simple example is

>>> rx = re.compile('(.)(?=(.))')
>>> x = rx.finditer("abcdefgh")
>>> [y.groups() for y in x]
[('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('e', 'f'), ('f', 'g'), ('g', 'h')]

You may extend this idea to your RegEx.

KennyTM 2010-06-12 06:49:06

Thanks, this is really useful, I'll update my question with the re.X formatted regex.

Raphink 2010-06-12 07:08:36

The lookahead solution fixes my issue, but then I can't get group(0) anymore.I put the lookahead around the whole block that follows the book name (around everything that's after '\.?'). Now when I try to match things like 'jn 3:16', I get: >>> REF_REGEX.search("jn 3:16").groups() (None, 'jn', '3', '16', None, None, None) >>> REF_REGEX.search("jn 3:16").group(0) 'jn'I don't understand why group(0) doesn't return the whole matched string.

Raphink 2010-06-12 08:46:34

@Rap: The lookahead part isn't counted towards the match. You need to use `foo(?=(bar(etc)))` and use `group(0) + group(1)` if you need the entire match.

KennyTM 2010-06-12 09:23:57

Thanks. I ended up giving a name to it, in my case something like `(?=(?P<notbook))` and using `group(0)+group('notbook')`. I also had to recalculate the end as match.end()+len(match.group('notbook')).Thank you for your precious help :-)

Raphink 2010-06-12 13:49:57

ansaurus

tags:

views:

answers:

Overlapping matches with finditer() in Python

related questions