ansaurus

Question

Using pyparsing to parse a word escape-split over multiple lines

Answer 1

+3 A:

After poking around for a bit more, I came upon this help thread where there was this notable bit

I often see inefficient grammars when someone implements a pyparsing grammar directly from a BNF definition. BNF does not have a concept of "one or more" or "zero or more" or "optional"...

With that, I got the idea to change these two lines

multi_line_word = Forward()
multi_line_word << (word | (split_word + multi_line_word))

To

multi_line_word = ZeroOrMore(split_word) + word

This got it to output what I was looking for: ['super', 'cali', fragi', 'listic'].

Next, I added a parse action that would join these tokens together:

multi_line_word.setParseAction(lambda t: ''.join(t))

This gives a final output of ['supercalifragilistic'].

The take home message I learned is that one doesn't simply walk into Mordor.

Just kidding.

The take home message is that one can't simply implement a one-to-one translation of BNF with pyparsing. Some tricks with using the iterative types should be called into use.

EDIT 2009-11-25: To compensate for the more strenuous test cases, I modified the code to the following:

no_space = NotAny(White(' \t\r'))
# make sure that the EOL immediately follows the escape backslash
continued_ending = Literal('\\') + no_space + lineEnd
word = Word(alphas)
# make sure that the escape backslash immediately follows the word
split_word = word + NotAny(White()) + Suppress(continued_ending)
multi_line_word = OneOrMore(split_word + NotAny(White())) + Optional(word)
multi_line_word.setParseAction(lambda t: ''.join(t))

This has the benefit of making sure that no space comes between any of the elements (with the exception of newlines after the escaping backslashes).

gotgenes 2009-11-15 04:10:18

Using `Combine` also enforces no intervening whitespace.

Paul McGuire 2009-11-16 06:24:46

Interesting. tried`multi_line_word = Combine(Combine(OneOrMore(split_word)) + Optional(word))`but it breaks on the `'sh\\\n iny'` case in that it doesn't raise an exception, but instead returns `['sh']`. Am I missing something?

gotgenes 2009-11-16 20:04:49

Well, your word is not just letters spanning a '\'-newline, but there is that space in there before the letter 'i', which counts as a word break, so Combine stops after the 'sh'. You *can* modify Combine with an adjacent=False constructor argument, but beware - you might end up sucking up the entire file as a single word! Or you can redefine your definition of continued_ending to include any whitespace after the lineEnd, if you want to also collapse any leading spaces.

Paul McGuire 2009-11-17 01:56:25

I would prefer `multi_line_word.parseString('sh\\\n iny')` raise `ParseException`, not identify `'sh'` as its token. In this case `'sh'` and `'iny'` are two words, not parts of a broken word, because the `'iny'` part is not contiguous with the EOL. Thus, `multi_line_word` shouldn't recognize it. It should throw up its hands and say, "This is not a valid broken-up word!"

gotgenes 2009-11-17 16:34:20

Answer 2

A:

Is this in exercise in pyparsing? If not, then don't bother with pyparsing. Why not just do something like this?

text.replace("\\\n", "")

Chris Lacasse 2009-11-15 04:12:31

This is absolutely an exercise in pyparsing; part of a larger parser in which matching these cases is necessary. That's why "Using pyparsing" is in the title.

gotgenes 2009-11-15 05:36:14

Answer 3

+2 A:

You are pretty close with your code. Any of these mods would work:

# '|' means MatchFirst, so you had a left-recursive expression
# reversing the order of the alternatives makes this work
multi_line_word << ((split_word + multi_line_word) | word)

# '^' means Or/MatchLongest, but beware using this inside a Forward
multi_line_word << (word ^ (split_word + multi_line_word))

# an unusual use of delimitedList, but it works
multi_line_word = delimitedList(word, continued_ending)

# in place of your parse action, you can wrap in a Combine
multi_line_word = Combine(delimitedList(word, continued_ending))

As you found in your pyparsing googling, BNF->pyparsing translations should be done with a special view to using pyparsing features in place of BNF, um, shortcomings. I was actually in the middle of composing a longer answer, going into more of the BNF translation issues, but you have already found this material (on the wiki, I assume).

Paul McGuire 2009-11-15 16:51:08

ansaurus

tags:

views:

answers:

Using pyparsing to parse a word escape-split over multiple lines

related questions