views:

53

answers:

4

I am trying to capture / extract numeric values from some strings.

Here is a sample string:

s='The shipping company had 93,999,888.5685 gallons of fuel on hand'

I want to pull the 93,999,888.5685 value I have gotten my regex to this

> mine=re.compile("(\d{1,3}([,\d{3}])*[.\d+]*)")

However, when I do a findall I get the following:

mine.findall(s)

[('93,999,888.5685', '8')]

I have tried a number of different strategies to keep it from matching on the 8

But I am now realizing that I am not sure I know why it matched on the 8

Any illumination would be appreciated.

+1  A: 

Your string broken up:

(
\d{1,3}       This will match any group of 1-3 digits (`8`, `12`, `000`, etc)
  (
     [,\d{3}] This will match groups of a "," and 3 digits (`,123`, `,000`, etc)
  )*            **from zero to infinity times**
  [.\d+]*     This matches any number of periods "." and digits from 0 to infinity
)
Nick T
A: 

Why not wrap it in \D ? mine=re.compile("\D(\d{1,3}([,\d{3}])[.\d+])\D").

chx
Did you test this? I get more garbage
PyNEwbie
+4  A: 

The reason the 8 is being captured is because you have 2 capturing groups. Mark the 2nd group as a non-capturing group using ?: with this pattern: (\d{1,3}(?:[,\d{3}])*[.\d+]*)

Your second group, ([,\d{3}]) is responsible for the additional match.

Ahmad Mageed
Thank you very much and now I need to figure out capturing versus non-capturing groups. But I will and this answer will help me get there.
PyNEwbie
@PyNEwbie the idea is your pattern must match overall, but anything within parentheses `()` is a capturing group. Sometimes you need to use groups within your pattern but ultimately you don't care to capture them, just match them. In those cases you can mark them as non-capturing. Another approach is to used named groups and extract the values from the named groups you're interested in. Refer to http://docs.python.org/library/re.html for more info.
Ahmad Mageed
A: 

findall returns a tuple for each match. The tuple contains each group (delineated by parenthesis in the regex) of the match. You want the first group only. Below I've used a list comprehension to pull out the first group.

>>> mine=re.compile("(\d{1,3}(,\d{3})*(\.?\d+)*)")
>>> s='blah 93,999,888.5685 blah blah blah 988,122.3.'
>>> [m[0] for m in mine.findall(s)]
['93,999,888.5685', '988,122.3']
Steven Rumbalski