views:

181

answers:

3

This code works:

from pyparsing import *

zipRE = "\d{5}(?:[-\s]\d{4})?" 
fooRE = "^\!\s+.*"

zipcode = Regex( zipRE )
foo = Regex( fooRE )

query = ( zipcode | foo )



tests = [ "80517", "C6H5OH", "90001-3234", "! sfs" ]

for t in tests:
    try:
        results = query.parseString( t )
        print t,"->", results
    except ParseException, pe:
        print pe

I'm stuck on two issues:

1 - How to use a custom function to parse a token. For instance, if I wanted to use some custom logic instead of a regex to determine if a number is a zipcode. Instead of:

zipcode = Regex( zipRE )

perhaps:

zipcode = MyFunc()

2 - How do I determine what a string parses TO. "80001" parses to "zipcode" but how do I determine this using pyparsing? I'm not parsing a string for its contents but simply to determine what kind of query it is.

+3  A: 

You could use zipcode and foo separately, so that you know which one the string matches.

zipresults = zipcode.parseString( t )
fooresults = foo.parseString( t )
pwdyson
+2  A: 

I do not have the pyparsing module, but Regex must be a class, not a function.

What you can do is subclass from it and override methods as required to customize behaviour, then use your subclasses instead.

badp
+2  A: 

Your second question is easy, so I'll answer that first. Change query to assign results names to the different expressions:

query = ( zipcode("zip") | foo("foo") ) 

Now you can call getName() on the returned result:

print t,"->", results, results.getName()

Giving:

80517 -> ['80517'] zip
Expected Re:('\\d{5}(?:[-\\s]\\d{4})?') (at char 0), (line:1, col:1)
90001-3234 -> ['90001-3234'] zip
! sfs -> ['! sfs'] foo

If you are going to use the result's fooness or zipness to call another function, then you could do this at parse time by attaching a parse action to your foo and zipcode expressions:

# enclose zipcodes in '*'s, foos in '#'s
zipcode.setParseAction(lambda t: '*' + t[0] + '*')
foo.setParseAction(lambda t: '#' + t[0] + '#')

query = ( zipcode("zip") | foo("foo") ) 

Now gives:

80517 -> ['*80517*'] zip
Expected Re:('\\d{5}(?:[-\\s]\\d{4})?') (at char 0), (line:1, col:1)
90001-3234 -> ['*90001-3234*'] zip
! sfs -> ['#! sfs#'] foo

For your first question, I don't exactly know what kind of function you mean. Pyparsing provides many more parsing classes than just Regex (such as Word, Keyword, Literal, CaselessLiteral), and you compose your parser by combining them with '+', '|', '^', '~', '@' and '*' operators. For instance, if you wanted to parse for a US social security number, but not use a Regex, you could use:

ssn = Combine(Word(nums,exact=3) + '-' + 
        Word(nums,exact=2) + '-' + Word(nums,exact=4))

Word matches for contiguous "words" made up of the given characters in its constructor, Combine concatenates the matched tokens into a single token.

If you wanted to parse for a potential list of such numbers, delimited by '/'s, use:

delimitedList(ssn, '/')

or if there were between 1 and 3 such numbers, with no delimters, use:

ssn * (1,3)

And any expression can have results names or parse actions attached to them, to further enrich the parsed results, or the functionality during parsing. You can even build recursive parsers, such as nested lists of parentheses, arithmetic expressions, etc. using the Forward class.

My intent when I wrote pyparsing was that this composition of parsers from basic building blocks would be the primary form for creating a parser. It was only in a later release that I added Regex as (what I though was) the ultimate escape valve - if people couldn't build up their parser, they could fall back on regex's format, which has definitely proven its power over time.

Or, as one other poster suggests, you can open up the pyparsing source, and subclass one of the existing classes, or write your own, following their structure. Here is a class that would match for paired characters:

class PairOf(Token):
    """Token for matching words composed of a pair
       of characters in a given set.
    """
    def __init__( self, chars ):
        super(PairOf,self).__init__()
        self.pair_chars = set(chars)

    def parseImpl( self, instring, loc, doActions=True ):
        if (loc < len(instring)-1 and 
           instring[loc] in self.pair_chars and
           instring[loc+1] == instring[loc]):
            return loc+2, instring[loc:loc+2]
        else:
            raise ParseException(instring, loc, "Not at a pair of characters")

So that:

punc = r"~!@#$%^&*_-+=|\?/"
parser = OneOrMore(Word(alphas) | PairOf(punc))
print parser.parseString("Does ** this match @@@@ %% the parser?")

Gives:

['Does', '**', 'this', 'match', '@@', '@@', '%%', 'the', 'parser']

(Note the omission of the trailing single '?')

Paul McGuire
Thank you for such a thoughtful response. I literally dreamed about this last night. I think I have an example to illustrate what I am trying to do. One of the pyparsing examples is parsing a street address. What if I wanted to extend this code to parse for a City and State. I could parse for the state using stateAbbreviation = oneOf("""AA AE AK AL... (as you provided). But what if I want to parse for a City by calling a function in another piece of code, perhaps isStringPlacenmae(), or maybe a webservice that will tell me if a given string is a placename...
Art
Some more info: I'm writing a parser to categorise search queries. I hope to determine if a query is an address, zipcode, tracking number, mathematical expression, etc...
Art
A parse action can be good for adding further validation beyond simple expression matching. If the further logic fails, have the parse action raise a ParseException, and the parser will treat this like a failure to match the original expression.
Paul McGuire