tags:

views:

50

answers:

2

This is Part 2 of this question and thanks very much for David's answer. What if I need to extract dates which are bounded by two keywords?

Example:

text = "One 09 Jun 2011 Two 10 Dec 2012 Three 15 Jan 2015 End"

Case 1 bounding keyboards: "One" and "Three"
Result expected: ['09 Jun 2011', '10 Dec 2012']

Case 2 bounding keyboards: "Two" and "End"
Result expected: ['10 Dec 2012', '15 Jan 2015']

Thanks!

A: 

Do you really need to worry about the keywords? Can you ensure that the keywords will not change?

If not, the exact same solution from the previous question can solve this:

>>> import re
>>> text = "One 09 Jun 2011 Two 10 Dec 2012 Three 15 Jan 2015 End"
>>> match = re.findall(r'\d\d\s(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{4}', text)
>>> match
['09 Jun 2011', '10 Dec 2012', '15 Jan 2015']

If you really only need two of the dates, you could just use list slicing:

>>> match[:2]
['09 Jun 2011', '10 Dec 2012']
>>> match[1:]
['10 Dec 2012', '15 Jan 2015']
jathanism
The keywords (user-defined) are important for excluding some dates which are not inside the relevant part of a document.
ohho
So the keywords will be different and will have variable lengths? You'll have to use greedy matching. Alpha only, or alphanumeric? These are all important considerations when building your patterns.
jathanism
Please consider the bounding keywords to be 2 constant strings.
ohho
+2  A: 

You can do this with two regular expressions. One regex gets the text between the two keywords. The other regex extracts the dates.

match = re.search(r"\bOne\b(.*?)\bThree\b", text, re.DOTALL)
if match:
    betweenwords = match.group(1)
    dates = re.findall(r'\d\d (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4}', betweenwords) 
Jan Goyvaerts
it works, thx! except re.findal(..., text) should be re.findall(..., betweenwords)btw, is the first and last "\b" required in the first regex?
ohho
I have corrected the `findall` parameter. All 4 `\b` are required if you want your words to be matched as whole words. E.g. `\bEnd\b` cannot match `Ending`. If you don't care whether your two keywords are whole or partial words, then you can omit all 4 `\b`.
Jan Goyvaerts