views:

157

answers:

4

Hi I am sorry if this question is too simple but I can't seem to reason this out. I want to parse a string. I want to extract the words between 'The final score is: ' and '.'. In other words, if i have a string "The final score is: 25." I want it to extract 25. I don't know how to do this. Should I use match? or split??

Thanks

A: 

you don't need a regex to do this... do the split on ":" , then get the last element. OR , just use string slicing

>>> "The final score is: 25."[-3:-1]
'25'

>>> s.split(":")[-1].strip()  #use -1 to always get the last element
'25'
ghostdog74
and if the final score is 125?
foosion
then use the split method. nothing impossible to do.
ghostdog74
+3  A: 

If you know your information is always going to be formatted like that then you can simply use split:

s = "The final score is: 25"
score = s.split(':')[1].strip()

This will result in score being 25. I use .strip() at the end to remove any whitespace as a safety measure.

Bartek
In the given sting there's a dot at the end. So it should be score = s.split(':')[1].strip()[:-1]
Maiku Mori
Good catch, I didn't notice the dot.
Bartek
+2  A: 

You're talking about "capturing" some characters between "The final score is:" and ".". This means you have a group. A group requires ()'s.

See http://docs.python.org/library/re.html for all the rules.

Since this smells like homework, I won't provide everything. The RE will have the form

matcher = r'something:(something).'

The ()'s define group which is save in the match object and can be retrieved.

You have RE rules to match specific letters 'T', 'h', 'e', etc.

You have RE rules to match digits '\d'

S.Lott
+2  A: 

If you're trying to understand regular expressions and not just trying to get a value out of a string, you may find this helpful.

The first concept you need is that of grouping. Parentheses in a regular expression delimit a group; re.match() has a groups() method that returns a tuple of the text matched by the groups in the pattern. For instance:

>>> re.match('foo(bar)baz', 'foobarbaz').groups()
('bar',)

So in your case, you would create a pattern that matched text up to the colon, then a group that matched the text you're searching for. And here we get to the second part of the problem: what patterns should you search for? For instance, this pattern will definitely work:

The final score is: (25).

But it's not exceptionally useful, since it will only return a match (and 25 in the first group) if the string you're matching is The final score is: 25.. It won't match any other string.

When you're composing a regular expression, the question you ask yourself is: "What parts of the input string can change, and how?" That tells you what kind of patterns to write.

For instance, if your source always contains one and only one colon, the first part of your pattern can be [^:]*:. You're defining a class of characters that's everything other than a colon ([^:]), saying that you want to match it zero or more times (*), and then saying that you want to match the colon (:).

If you know that your source always ends with a period, you can formulate the pattern used for the group the same way: "match every character that's not a period", or [^.]*. And you'll end up with this:

s = 'The final score is: 25.'
>>> re.match(r'[^:]*:([^.]*)', s).groups()
(' 25',)

This breaks if the value you're trying to capture contains a period, though. For a pattern that captures everything except the terminal period, you can define your group as ([\$]*) (using the $ end-of-line metacharacter in this way means that you want to, match zero or more of the remaining characters in this line) followed by .$. The terminal .$ means that in order for the pattern to match, it has to match the period at the end of the line. The group captures as many characters as it can right up until the point that grabbing any more will cause the pattern to not match.

That means that this works:

>>> s = "The final score is: this.is.something.different."
>>> re.match(r'[^:]*:([^\$]*).$', s).groups()
(' this.is.something.different',)

Okay, now let's look at another possible approach. Let's suppose that we don't know anything about the input except that there's going to be a colon, then somewhere after that a number, which may or may not be at the end of the string. In this case, our capturing group is clearly going to be ([\d]*), which grabs all of the digits it finds. But how do we formulate a pattern that correctly matches the widest range of possible inputs possible? Like this:

>>> s = '9. The answer is: 25 or thereabouts.'
>>> re.match(r'[^:]*[^\d]*([\d]*)', s).groups()
('25',)

Left to right, that pattern says: first, match everything that's not a colon. Then, once you hit the colon, match everything that's not a digit. Then grab all the digits.

I hope that helps. I'm still trying to learn regular expressions myself, which is why I'm bothering to write an answer as detailed as this.

Robert Rossney