views:

113

answers:

2

I have strings like this:

"MSE 2110, 3030, 4102"

I would like to output:

[("MSE", 2110), ("MSE", 3030), ("MSE", 4102)]

This is my way of going about it, although I haven't quite gotten it yet:

def makeCourseList(str, location, tokens):
    print "before: %s" % tokens

    for index, course_number in enumerate(tokens[1:]):
        tokens[index + 1] = (tokens[0][0], course_number)

    print "after: %s" % tokens

course = Group(DEPT_CODE + COURSE_NUMBER) # .setResultsName("Course")

course_data = (course + ZeroOrMore(Suppress(',') + COURSE_NUMBER)).setParseAction(makeCourseList)

This outputs:

>>> course.parseString("CS 2110")
([(['CS', 2110], {})], {})
>>> course_data.parseString("CS 2110, 4301, 2123, 1110")
before: [['CS', 2110], 4301, 2123, 1110]
after: [['CS', 2110], ('CS', 4301), ('CS', 2123), ('CS', 1110)]
([(['CS', 2110], {}), ('CS', 4301), ('CS', 2123), ('CS', 1110)], {})

Is this the right way to do it, or am I totally off?

Also, the output of isn't quite correct - I want course_data to emit a list of course symbols that are in the same format as each other. Right now, the first course is different from the others. (It has a {}, whereas the others don't.)

+1  A: 

Is this the right way to do it, or am I totally off?

It's one way to do it, though of course there are others (e.g. use as parse actions two bound method -- so the instance the method belongs to can keep state -- one for the dept code and another for the course number).

The return value of the parseString call is harder to bend to your will (though I'm sure sufficiently dark magic will do it and I look forward to Paul McGuire explaining how;-), so why not go the bound-method route as in...:

from pyparsing import *

DEPT_CODE = Regex(r'[A-Z]{2,}').setResultsName("DeptCode")
COURSE_NUMBER = Regex(r'[0-9]{4}').setResultsName("CourseNumber")

class MyParse(object):
  def __init__(self):
      self.result = None

  def makeCourseList(self, str, location, tokens):
      print "before: %s" % tokens

      dept = tokens[0][0]
      newtokens = [(dept, tokens[0][1])]
      newtokens.extend((dept, tok) for tok in tokens[1:])

      print "after: %s" % newtokens
      self.result = newtokens

course = Group(DEPT_CODE + COURSE_NUMBER).setResultsName("Course")

inst = MyParse()
course_data = (course + ZeroOrMore(Suppress(',') + COURSE_NUMBER)
    ).setParseAction(inst.makeCourseList)
ignore = course_data.parseString("CS 2110, 4301, 2123, 1110")
print inst.result

this emits:

before: [['CS', '2110'], '4301', '2123', '1110']
after: [('CS', '2110'), ('CS', '4301'), ('CS', '2123'), ('CS', '1110')]
[('CS', '2110'), ('CS', '4301'), ('CS', '2123'), ('CS', '1110')]

which seems to be what you require, if I read your specs correctly.

Alex Martelli
+3  A: 

This solution memorizes the department when parsed, and emits a (dept,coursenum) tuple when a number is found.

from pyparsing import Suppress,Word,ZeroOrMore,alphas,nums,delimitedList

data = '''\
MSE 2110, 3030, 4102
CSE 1000, 2000, 3000
'''

def memorize(t):
    memorize.dept = t[0]

def token(t):
    return (memorize.dept,int(t[0]))

course = Suppress(Word(alphas).setParseAction(memorize))
number = Word(nums).setParseAction(token)
line = course + delimitedList(number)
lines = ZeroOrMore(line)

print lines.parseString(data)

Output:

[('MSE', 2110), ('MSE', 3030), ('MSE', 4102), ('CSE', 1000), ('CSE', 2000), ('CSE', 3000)]
Mark Tolonen
+1 that is clever.
Rosarch
Thanks. Don't forget to accept an answer ;^)
Mark Tolonen
Extremely clever and elegant. Love it!
jathanism