views:

53

answers:

1

Hi there,

I'm using a regex to match Bible verse references in a text. The current regex is

REF_REGEX = re.compile('''
  (?<!\w)                        # Not preceded by any words
  (?P<quote>q(?:uote)?\s+)?      # Match optional 'q' or 'quote' followed by many spaces
  (?P<book>                           
    (?:(?:[1-3]|I{1,3})\s*)?     # Match an optional arabic or roman number between 1 and 3.
    [A-Za-z]+                    # Match any alphabetics
  )\.?                           # Followed by an optional dot
  (?:                         
    \s*(?P<chapter>\d+)          # Match the chapter number
    (?:
      [:\.](?P<startverse>\d+)   # Match the starting verse number, preceded by ':' or '.'
        (?:-(?P<endverse>\d+))?  # Match the optional ending verse number, preceded by '-'
    )?                           # Verse numbers are optional
  )
  (?:
    \s+(?:                       # Here be spaces
      (?:from\s+)|(?:in\s+)|(?P<lbrace>\())   # Match 'from[:space:]', 'in[:space:]' or '('
      \s*(?P<version>\w+)        # Match a word preceded by optional spaces
      (?(lbrace)\))              # Close the '(' if found earlier
  )?                             # The whole 'in|from|()' is optional
  ''', re.IGNORECASE | re.VERBOSE | re.UNICODE)

This matches the following expressions fine:

"jn 3:16":                           (None, 'jn', '3', '16', None, None, None),
"matt. 18:21-22":                    (None, 'matt', '18', '21', '22', None, None),
"q matt. 18:21-22":                  ('q ', 'matt', '18', '21', '22', None, None),
"QuOTe jn 3:16":                     ('QuOTe ', 'jn', '3', '16', None, None, None),
"q 1co13:1":                         ('q ', '1co', '13', '1', None, None, None), 
"q 1 co 13:1":                       ('q ', '1 co', '13', '1', None, None, None),
"quote 1 co 13:1":                   ('quote ', '1 co', '13', '1', None, None, None),
"quote 1co13:1":                     ('quote ', '1co', '13', '1', None, None, None),
"jean 3:18 (PDV)":                   (None, 'jean', '3', '18', None, '(', 'PDV'),
"quote malachie 1.1-2 fRom Colombe": ('quote ', 'malachie', '1', '1', '2', None, 'Colombe'),
"quote malachie 1.1-2 In Colombe":   ('quote ', 'malachie', '1', '1', '2', None, 'Colombe'),
"cinq jn 3:16 (test)":               (None, 'jn', '3', '16', None, '(', 'test'),
"Q   IIKings5.13-58   from   wolof": ('Q     ', 'IIKings', '5', '13', '58', None, 'wolof'),
"This text is about lv5.4-6 in KJV only": (None, 'lv', '5', '4', '6', None, 'KJV'),

but it fails to parse:

"Found in 2 Cor. 5:18-21 ( Ministers":                    (None, '2 Cor', '5', '18', '21', None, None),

because it returns (None, 'in', '2', None, None, None, None) instead.

Is there a way to get finditer() to return all matches, even if they overlap, or is there a way to improve my regex so it matches this last bit properly?

Thanks.

+2  A: 

A character consumed is consumed, you should not ask the regex engine to go back.

From your examples the verse part (e.g. :1) seems not optional. Removing that will match the last bit.

ref_regex = re.compile('''
(?<!\w)                      # Not preceeded by any words
((?i)q(?:uote)?\s+)?            # Match 'q' or 'quote' followed by many spaces
(
    (?:(?:[1-3]|I{1,3})\s*)?    # Match an arabic or roman number between 1 and 3.
    [A-Za-z]+                   # Match many alphabetics
)\.?                            # Followed by an optional dot
(?:
    \s*(\d+)                    # Match the chapter number
    (?:
        [:.](\d+)               # Match the verse number
        (?:-(\d+))?             # Match the ending verse number
    )                    # <-- no '?' here
)
(?:
    \s+
    (?:
        (?i)(?:from\s+)|        # Match the keyword 'from' or 'in'
        (?:in\s+)|
        (?P<lbrace>\()      # or stuff between (...)
    )\s*(\w+)
    (?(lbrace)\))
)?
''', re.X | re.U)

(If you're going to write a gigantic RegEx like this, please use the /x flag.)


If you really need overlapping matches, you could use a lookahead. A simple example is

>>> rx = re.compile('(.)(?=(.))')
>>> x = rx.finditer("abcdefgh")
>>> [y.groups() for y in x]
[('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('e', 'f'), ('f', 'g'), ('g', 'h')]

You may extend this idea to your RegEx.

KennyTM
Thanks, this is really useful, I'll update my question with the re.X formatted regex.
Raphink
The lookahead solution fixes my issue, but then I can't get group(0) anymore.I put the lookahead around the whole block that follows the book name (around everything that's after '\.?'). Now when I try to match things like 'jn 3:16', I get: >>> REF_REGEX.search("jn 3:16").groups() (None, 'jn', '3', '16', None, None, None) >>> REF_REGEX.search("jn 3:16").group(0) 'jn'I don't understand why group(0) doesn't return the whole matched string.
Raphink
@Rap: The lookahead part isn't counted towards the match. You need to use `foo(?=(bar(etc)))` and use `group(0) + group(1)` if you need the entire match.
KennyTM
Thanks. I ended up giving a name to it, in my case something like `(?=(?P<notbook))` and using `group(0)+group('notbook')`. I also had to recalculate the end as match.end()+len(match.group('notbook')).Thank you for your precious help :-)
Raphink