views:

605

answers:

4

Goal: Given a number (it may be very long and it is greater than 0), I'd like to get the five least meaningful digits dropping any 0 at the end of that number.

I tried to solve this with regex, Helped by RegexBuddy I came to this one:

[\d]+([\d]{0,4}+[1-9])0*

But python can't compile that.

>>> import re
>>> re.compile(r"[\d]+([\d]{0,4}+[1-9])0*")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.5/re.py", line 188, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python2.5/re.py", line 241, in _compile
    raise error, v # invalid expression
sre_constants.error: multiple repeat

The problem is the "+" after "{0,4}", it seems it doesn't work in python (even in 2.6)

How can I write a working regex?

PS: I know you can start dividing by 10 and then using the remainder n%100000... but this is a problem about regex.

+2  A: 

Small tip. I recommend you test with reTest instead of RegExBuddy. There are different regular expression engines for different programming languages. ReTest is valuable in that it allows you to quickly test regular expression strings within Python itself. That way you can insure that you tested your syntax with the Python's regular expression engine.

DoxaLogos
Ultimately, any regular expression you use has to be tested in your actual application, on your actual data. Running initial tests in a tool like RegexBuddy while your regex is under construction saves you time, as long as the tool is used properly (in this case, select Python in RegexBuddy's toolbar when using Python).
Jan Goyvaerts
A: 

The error seems to be that you have two quantifiers in a row, {0,4} and +. Unless + is meant to be a literal here (which I doubt, since you're talking about numbers), then I don't think you need it at all. Unless it means something different in this situation (possibly the greediness of the {} quantifier)? I would try

[\d]+([\d]{0,4}[1-9])0*

If you actually intended to have both quantifiers to be applied, then this might work

[\d]+(([\d]{0,4})+[1-9])0*

But given your specification of the problem, I doubt that's what you want.

Sean Nyman
The "+" after a quantifier means that it is posessive. Python does not support posessive quatifiers.
MizardX
+7  A: 

That regular expression is very superfluous. Try this:

>>> import re
>>> re.compile(r"(\d{0,4}[1-9])0*$")

The above regular expression assumes that the number is valid (it will also match "abc0123450", for example.) If you really need the validation that there are no non-number characters, you may use this:

>>> import re
>>> re.compile(r"^\d*?(\d{0,4}[1-9])0*$")

Anyways, the \d does not need to be in a character class, and the quantifier {0,4} does not need to be forced to be greedy (as the additional + specifies, although apparently Python does not recognize that.)

Also, in the second regular expression, the \d is non-greedy, as I believe this will improve the performance and accuracy. I also made it "zero or more" as I assume that is what you want.

I also added anchors as this ensures that your regular expression won't match anything in the middle of a string. If this is what you desired though (maybe you're scanning a long text?), remove the anchors.

Blixt
The second one is what I needed. The first will match too many groups Thanks!!!
Andrea Ambu
+2  A: 

\d{0,4}+ is a possessive quantifier supported by certain regular expression flavors such as .NET and Java. Python does not support possessive quantifiers.

In RegexBuddy, select Python in the toolbar at the top, and RegexBuddy will tell you that Python doesn't support possessive quantifiers. The + will be highlighted in red in the regular expression, and the Create tab will indicate the error.

If you select Python on the Use tab in RegexBuddy, RegexBuddy will generate a Python source code snippet with a regular expression without the possessive quantifier, and a comment indicating that the removal of the possessive quantifier may yield different results. Here's the Python code that RegexBuddy generates using the regex from the question:

# Your regular expression could not be converted to the flavor required by this language:
# Python does not support possessive quantifiers

# Because of this, the code snippet below will not work as you intended, if at all.

reobj = re.compile(r"[\d]+([\d]{0,4}[1-9])0*")

What you probably did is select a flavor such as Java in the main toolbar, and then click Copy Regex as Python String. That will give you a Java regular expression formatted as a Pythong string. The items in the Copy menu do not convert your regular expression. They merely format it as a string. This allows you to do things like format a JavaScript regular expression as a Python string so your server-side Python script can feed a regex into client-side JavaScript code.

Jan Goyvaerts
Oh well, there is a pretty old version at school, just downloaded the new one at home and there is the toolbar :D thanks!
Andrea Ambu
My response applies to RegexBuddy 3.0.0 and later. Version 3.0.0 was released on 13 June 2007. That's the first version that can emulate different regex flavors (currently 15).
Jan Goyvaerts