views:

156

answers:

2

I am new to Python and pyparsing. I need to accomplish the following.

My sample line of text is like this:

12 items - Ironing Service    11 Mar 2009 to 10 Apr 2009
Washing service (3 Shirt)  23 Mar 2009

I need to extract the item description, period

tok_date_in_ddmmmyyyy = Combine(Word(nums,min=1,max=2)+ " " + Word(alphas, exact=3) + " " + Word(nums,exact=4))
tok_period = Combine((tok_date_in_ddmmmyyyy + " to " + tok_date_in_ddmmmyyyy)|tok_date_in_ddmmmyyyy)

tok_desc =  Word(alphanums+"-()") but stop before tok_period

How to do this?

+3  A: 

M K Saravanan, this particular parsing problem is not so hard to do with good 'ole re:

import re
import string

text='''
12 items - Ironing Service    11 Mar 2009 to 10 Apr 2009
Washing service (3 Shirt)  23 Mar 2009
This line does not match
'''

date_pat=re.compile(
    r'(\d{1,2}\s+[a-zA-Z]{3}\s+\d{4}(?:\s+to\s+\d{1,2}\s+[a-zA-Z]{3}\s+\d{4})?)')
for line in text.splitlines():
    if line:
        try:
            description,period=map(string.strip,date_pat.split(line)[:2])
            print((description,period))
        except ValueError:
            # The line does not match
            pass

yields

# ('12 items - Ironing Service', '11 Mar 2009 to 10 Apr 2009')
# ('Washing service (3 Shirt)', '23 Mar 2009')

The main workhorse here is of course the re pattern. Let's break it apart:

\d{1,2}\s+[a-zA-Z]{3}\s+\d{4} is the regexp for a date, the equivalent of tok_date_in_ddmmmyyyy. \d{1,2} matches one or two digits, \s+ matches one or more whitespaces, [a-zA-Z]{3} matches 3 letters, etc.

(?:\s+to\s+\d{1,2}\s+[a-zA-Z]{3}\s+\d{4})? is a regexp surrounded by (?:...). This indicates a non-grouping regexp. Using this, no group (e.g. match.group(2)) is assigned to this regexp. This matters because date_pat.split() returns a list with each group being a member of the list. By suppressing the grouping, we keep the entire period 11 Mar 2009 to 10 Apr 2009 together. The question mark at the end indicates that this pattern may occur zero or once. This allows the regexp to match both 23 Mar 2009 and 11 Mar 2009 to 10 Apr 2009.

text.splitlines() splits text on \n.

date_pat.split('12 items - Ironing Service 11 Mar 2009 to 10 Apr 2009')

splits the string on the date_pat regexp. The match is included in the returned list. Thus we get:

['12 items - Ironing Service ', '11 Mar 2009 to 10 Apr 2009', '']

map(string.strip,date_pat.split(line)[:2]) prettifies the result.

If line does not match date_pat, then date_pat.split(line) returns [line,], so

description,period=map(string.strip,date_pat.split(line)[:2])

raises a ValueError because we can't unpack a list with only one element into a 2-tuple. We catch this exception but simply pass on to the next line.

unutbu
Thanks a lot for the detailed explanation.
M K Saravanan
+2  A: 

I would suggest looking at SkipTo as the pyparsing class that is most appropriate, since you have a good definition of the unwanted text, but will accept pretty much anything before that. Here are a couple of ways to use SkipTo:

text = """\
12 items - Ironing Service    11 Mar 2009 to 10 Apr 2009
Washing service (3 Shirt)  23 Mar 2009"""

# using tok_period as defined in the OP

# parse each line separately
for tx in text.splitlines():
    print SkipTo(tok_period).parseString(tx)[0]

# or have pyparsing search through the whole input string using searchString
for [[td,_]] in SkipTo(tok_period,include=True).searchString(text):
    print td

Both for loops print the following:

12 items - Ironing Service    
Washing service (3 Shirt)
Paul McGuire
Thanks Paul for pointing me to the right class and for the code snippet.
M K Saravanan