M K Saravanan, this particular parsing problem is not so hard to do with good 'ole re:
import re
import string
text='''
12 items - Ironing Service 11 Mar 2009 to 10 Apr 2009
Washing service (3 Shirt) 23 Mar 2009
This line does not match
'''
date_pat=re.compile(
r'(\d{1,2}\s+[a-zA-Z]{3}\s+\d{4}(?:\s+to\s+\d{1,2}\s+[a-zA-Z]{3}\s+\d{4})?)')
for line in text.splitlines():
if line:
try:
description,period=map(string.strip,date_pat.split(line)[:2])
print((description,period))
except ValueError:
# The line does not match
pass
yields
# ('12 items - Ironing Service', '11 Mar 2009 to 10 Apr 2009')
# ('Washing service (3 Shirt)', '23 Mar 2009')
The main workhorse here is of course the re pattern. Let's break it apart:
\d{1,2}\s+[a-zA-Z]{3}\s+\d{4}
is the regexp for a date, the equivalent of tok_date_in_ddmmmyyyy
. \d{1,2}
matches one or two digits, \s+
matches one or more whitespaces, [a-zA-Z]{3}
matches 3 letters, etc.
(?:\s+to\s+\d{1,2}\s+[a-zA-Z]{3}\s+\d{4})?
is a regexp surrounded by (?:...)
.
This indicates a non-grouping regexp. Using this, no group (e.g. match.group(2)) is assigned to this regexp. This matters because date_pat.split() returns a list with each group being a member of the list. By suppressing the grouping, we keep the entire period 11 Mar 2009 to 10 Apr 2009
together. The question mark at the end indicates that this pattern may occur zero or once. This allows the regexp to match both
23 Mar 2009
and 11 Mar 2009 to 10 Apr 2009
.
text.splitlines()
splits text on \n
.
date_pat.split('12 items - Ironing Service 11 Mar 2009 to 10 Apr 2009')
splits the string on the date_pat regexp. The match is included in the returned list.
Thus we get:
['12 items - Ironing Service ', '11 Mar 2009 to 10 Apr 2009', '']
map(string.strip,date_pat.split(line)[:2])
prettifies the result.
If line
does not match date_pat
, then date_pat.split(line)
returns [line,]
,
so
description,period=map(string.strip,date_pat.split(line)[:2])
raises a ValueError because we can't unpack a list with only one element into a 2-tuple. We catch this exception but simply pass on to the next line.