ansaurus

Question

Need Help Understanding how to use less complex regex in Python

Answer 1

+4 A:

Are there nested parentheses (your regexps assume there are not)? If not:

whether_paren_is_missing = (astring[0] == '(' and not astring[-1] == ')')

To validate a dollar amount part:

import re

cents = r"(?:\.\d\d)" # cents 
re_dollar_amount = re.compile(r"""(?x)
    ^               # match at the very begining of the string
    \$?             # optional dollar sign
    (?:               # followed by
        (?:             # integer part  
        0               # zero
        |               # or
        [1-9]\d{,2}     # 1 to 3 digits (no leading zero) 
        (?:               # followed by
            (?:,\d{3})*     # zero or more three-digits groups with commas                          
            |               # or
            \d*             # zero or more digits without commas (no leading zero)
            )
        )
        (?:\.|%(cents)s)?   # optional f.p. part 
    |               # or
    %(cents)s       # pure f.p. '$.01'
    )
    $               # match end of string
    """ % vars())

Allow:

Forbid:

J.F. Sebastian 2008-12-11 23:57:50

Note to the original poster: The (?x) at the start sets the re.X flag, also known as the re.VERBOSE flag, which makes python ignore whitespace and comments. This allows you to write regular expression sthat are much easier to follow, as in this example.

John Fouhy 2008-12-12 01:04:12

Thanks, this was very helpful and extended my understanding of regular expressions. I marked Karwin's solution as the accepted answer because his looks (to my naive eye) more efficient. But thanks for your contribution.

PyNEwbie 2008-12-12 20:54:37

Answer 2

A:

One difference I see at a glance is that your regex will not find strings like:

(123,,,

That's because the corrected version requires at least one digit between commas. (A reasonable requirement, I'd say.)

Jon Ericson 2008-12-12 00:02:56

You make a good point except I have no control over what exists, that I need to be able to pull whatever is reported. My fault for not explaining in more detail that the need is driven by the fact that in many tables the closing paren is in the next cell. This keeps me from having to look ahead

PyNEwbie 2008-12-12 20:50:50

Answer 3

+3 A:

The trickier part about regular expressions isn't making them accept valid input, it's making them reject invalid input. For example, the second expression accepts input that is clearly wrong, including:

(1,2,3,4 -- one digit between each comma
(12,34,56 -- two digits between each comma
(1234......5 -- unlimited number of decimal points
(1234,.5 -- comma before decimal point
(123,456789,012 -- if there are some commas, they should be between each triple
(01234 -- leading zero is not conventional
(123.4X -- last char is not a closing paren

Here's an alternative regular expression that should reject the examples above:

[-+]?[$]?(0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})*)(\.\d+)?

Optional leading plus/minus.
Optional dollar sign.
Three choices separated by |:
- Single zero digit (for numbers like 0.5 or simply 0).
- Any number of digits with no commas. The first digit must not be zero.
- Comma-separated digits. The first digit must not be zero. Up to three digits before the first comma. Each comma must be followed by exactly three digits.
Optional single decimal point, which must be followed by one or more digits.

Regarding the parens, if all you care about is whether the parens are balanced, then you can disregard parsing out the numeric format precisely; just trust that any combination of digits, decimal points, and commas between the parens are valid. Then use the (?!...) construct that evaluates as a match if the input doesn't match the regular expression inside.

(?!$[$\d.,]+$)

Bill Karwin 2008-12-12 00:14:10

Thanks- all I need is to find if a paren was missing so I was going down the hard path. I learned a lot from your answer and I appreciate the time you took to supply it.

PyNEwbie 2008-12-12 20:52:19

Answer 4

A:

I've found very helpful to use kiki when tailoring a regex. It shows you visually what's going on with your regexes. It is a huge time-saver.

Sergio 2009-05-15 06:16:29

ansaurus

tags:

views:

answers:

Need Help Understanding how to use less complex regex in Python

related questions