views:

41

answers:

2

I'd like to create a (PCRE) regular expression to match all commonly used numbered lists, and I'd like to share my thoughts and gather input on way to do this.

I've defined 'lists' as the set of canonical Anglo-Saxon conventions, i.e.

Numbers

1 2 3
1. 2. 3.
1) 2) 3)
(1) (2) (3)
1.1 1.2 1.2.1
1.1. 1.2. 1.3.
1.1) 1.2) 1.3)
(1.1) (1.2) (1.3)

Letters

a b c
a. b. c.
a) b) c)
(a) (b) (c) 
A B C
A. B. C. 
A) B) C)
(A) (B) (C)

Roman numerals

i ii iii
i. ii. iii.
i) ii) iii)
(i) (ii) (iii)
I II III
i. ii. iii.
i) ii) iii)
(i) (ii) (iii)

I'd like to know how strong a set of list this is, and if there are other numbering conventions that should be in there, and if any of these ought to be removed.

Here's a regular expression I've created to solve this problem (in Python):

numex = r'(?:\d{1,3}'\   # 1, 2, 3
    '(?:\.\d{1,3}){0,4}'\ # 1.1, 1.1.1.1
    '|[A-Z]{1,2}'\        # A. B. C.
    '|[ivxcl]{1,6}'       # i, iii, ...

rex = re.compile(r'(\(?%s\)|%s\.?)' % numex, re.I) # re.U?

rex.match("123. Some paragraph")    

I'd like to know how adequate this regex is for this problem, and if there are other alternative (regex or otherwise) solutions.

Incidentally, for my particular use-case, I wouldn't expect list numbers of more than 25-50.

Thank you for reading.

Brian

+1  A: 

I'd change at least one thing, and that is to add word boundary anchors around your regex, otherwise it will match every single letter in any text:

rex = re.compile(r'(\(?\b%s\)|\b%s\b\.?)' % (numex, numes), re.I|re.M)

This helps a little, but of course any one- or two-letter word will still be matched.

You might want to anchor the search at the start of the line; after all these characters should be the first thing on the line (except maybe whitespace). A negative lookbehind won't word in Python because Python doesn't support variable-length lookbehind, so you could add this outside the matching parentheses:

rex = re.compile(r'^\s*(\(?%s\)|%s\b\.?)' % (numex, numex), re.I|re.M)

Of course, now you must look at the match object's group(1) to only get the actual match and not the leading whitespace.

You will still match too much (e. g. sentences starting with I thought so or It was a dark and stormy night, but your rules allow this, and I think you're aware of this.

Tim Pietzcker
@Tim: Thanks for the post. I just noticed this problem with alpha false positives. I think I've resolved it by (1) simplifying by reducing character lists to one letter, and (2) with a negative zero-width lookahead.
Brian M. Hunt
A: 

Here's a Wikified solution:

 numex = r"""^(?:
      \d{1,3}                 # 1, 2, 3
          (?:\.\d{1,3}){0,4}  # 1.1, 1.1.1.1
    | [B-H] | [J-Z]         # A, B - Z caps at 26.
    | [AI](?!\s)            # Note: "A" and "I" can properly start non-lists
    | [a-z]                 # a - z
    | [ivxcl]{1,6}          # Roman ii, etc
    | [IVXCL]{1,6}          # Roman IV, etc.
    )
    """

 rex = re.compile(r'^\s*(\(?%s\)|%s\.?)\s+(.*)'
   % (numex, numex), re.X)

Additions, changes and suggestions most welcome.

Brian M. Hunt