tags:

views:

317

answers:

4

Hi

I am trying to learn more about regular expressions I have one below that I believe finds cases where there is a missing close paren on a number up to 999 billion. The one below it I thought should do the same but I do not get similar results

   missingParenReg=re.compile(r"^\([$]*[0-9]{1,3}[,]?[0-9]{0,3}[,]?[0-9]{0,3}[,]?[0-9]{0,3}[.]*[0-9]*[^)]$")
   missingParenReg2=re.compile(r"^\([$]?([0-9]{1,3}[,]?)+[.]*[0-9]*[^)]$")

I think the second one says: There must be an open paren to start
There may or may not be as many as one dollar sign
The next group must exist at least once but can exist an unlimited number of times
The group should have at least one digit but may have as many as three
The group may have as few as 0 and as many as 1 commas
Following this group there may or may not be a decimal point
If there is a decimal point it will be followed by as many as 0 but as many as uncounted occurences of digits
At the end there should not be a closing paren.

I am trying to understand this magic stuff so I would appreciate a correction to my regex (if it can be corrected) in addition to a more elegant solution if you have it.

+4  A: 

Are there nested parentheses (your regexps assume there are not)? If not:

whether_paren_is_missing = (astring[0] == '(' and not astring[-1] == ')')

To validate a dollar amount part:

import re

cents = r"(?:\.\d\d)" # cents 
re_dollar_amount = re.compile(r"""(?x)
    ^               # match at the very begining of the string
    \$?             # optional dollar sign
    (?:               # followed by
        (?:             # integer part  
        0               # zero
        |               # or
        [1-9]\d{,2}     # 1 to 3 digits (no leading zero) 
        (?:               # followed by
            (?:,\d{3})*     # zero or more three-digits groups with commas                          
            |               # or
            \d*             # zero or more digits without commas (no leading zero)
            )
        )
        (?:\.|%(cents)s)?   # optional f.p. part 
    |               # or
    %(cents)s       # pure f.p. '$.01'
    )
    $               # match end of string
    """ % vars())

Allow:

    $0
    0
    $234
    22
    $0.01
    10000.12
    $99.90
    2,010,123
    1.00
    2,103.45
    $.10
    $1.

Forbid:

    01234
    00
    123.4X
    1.001
    .
J.F. Sebastian
Note to the original poster: The (?x) at the start sets the re.X flag, also known as the re.VERBOSE flag, which makes python ignore whitespace and comments. This allows you to write regular expression sthat are much easier to follow, as in this example.
John Fouhy
Thanks, this was very helpful and extended my understanding of regular expressions. I marked Karwin's solution as the accepted answer because his looks (to my naive eye) more efficient. But thanks for your contribution.
PyNEwbie
A: 

One difference I see at a glance is that your regex will not find strings like:

(123,,,

That's because the corrected version requires at least one digit between commas. (A reasonable requirement, I'd say.)

Jon Ericson
You make a good point except I have no control over what exists, that I need to be able to pull whatever is reported. My fault for not explaining in more detail that the need is driven by the fact that in many tables the closing paren is in the next cell. This keeps me from having to look ahead
PyNEwbie
+3  A: 

The trickier part about regular expressions isn't making them accept valid input, it's making them reject invalid input. For example, the second expression accepts input that is clearly wrong, including:

  • (1,2,3,4 -- one digit between each comma
  • (12,34,56 -- two digits between each comma
  • (1234......5 -- unlimited number of decimal points
  • (1234,.5 -- comma before decimal point
  • (123,456789,012 -- if there are some commas, they should be between each triple
  • (01234 -- leading zero is not conventional
  • (123.4X -- last char is not a closing paren

Here's an alternative regular expression that should reject the examples above:

[-+]?[$]?(0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})*)(\.\d+)?

  • Optional leading plus/minus.
  • Optional dollar sign.
  • Three choices separated by |:
    • Single zero digit (for numbers like 0.5 or simply 0).
    • Any number of digits with no commas. The first digit must not be zero.
    • Comma-separated digits. The first digit must not be zero. Up to three digits before the first comma. Each comma must be followed by exactly three digits.
  • Optional single decimal point, which must be followed by one or more digits.

Regarding the parens, if all you care about is whether the parens are balanced, then you can disregard parsing out the numeric format precisely; just trust that any combination of digits, decimal points, and commas between the parens are valid. Then use the (?!...) construct that evaluates as a match if the input doesn't match the regular expression inside.

(?!\([$\d.,]+\))

Bill Karwin
Thanks- all I need is to find if a paren was missing so I was going down the hard path. I learned a lot from your answer and I appreciate the time you took to supply it.
PyNEwbie
A: 

I've found very helpful to use kiki when tailoring a regex. It shows you visually what's going on with your regexes. It is a huge time-saver.

Sergio