ansaurus

Question

Keyword Matching in Pyparsing: non-greedy slurping of tokens

Answer 1

A:

This answers a question that you probably have also asked yourself: "What's a real-world application for reduce?):

>>> keys = ['CAT', 'DOG', 'HORSE', 'DEER', 'RHINOCEROS']
>>> p = reduce(lambda x, y: x | y, [Keyword(x) for x in keys])
>>> p
{{{{"CAT" | "DOG"} | "HORSE"} | "DEER"} | "RHINOCEROS"}

Edit:

This was a pretty good answer to the original question. I'll have to work on the new one.

Further edit:

I'm pretty sure you can't do what you're trying to do. The parser that pyparsing creates doesn't do lookahead. So if you tell it to match Word(alphanums + '_'), it's going to keep matching characters until it finds one that's not a letter, number, or underscore.

Robert Rossney 2009-12-15 06:31:42

Albeit slow (I have a huge file), Pyparsing was able to accomplish the desired task.

Arrieta 2009-12-15 21:29:15

Pyparsing has 2 flavors of lookahead, the NotAny class (abbreviated using the `~` operator) for negative lookahead, and the FollowedBy class (no operator shortcut - we have to draw the line somewhere, or we might as well be writing Perl) for an assertive lookahead. But you are correct to the extent that pyparsing does *not* implicitly look ahead in its grammar for "the next match", you have to code it in yourself using one of these constructs.

Paul McGuire 2009-12-16 00:06:47

Also, this chaining of several literals can be done in one of several ways: `Or(map(Keyword, "RED GRN BLUE".split()))` or `oneOf("RED GRN BLUE")`. oneOf is actually preferred here, as it has the smarts built in to be able to tell the difference between a lone ">" and the leading character of ">=".

Paul McGuire 2009-12-16 00:09:47

Answer 2

+6 A:

I based my answer off of this one, since what you're trying to do is get a non-greedy match. It seems like this is difficult to make happen in pyparsing, but not impossible with some cleverness and compromise. The following seems to work:

from pyparsing import *
Parameter = Literal('SPEED_X') | Literal('SPEED_Y') | Literal('SPEED_Z')
UndParam = Suppress('_') + Parameter
Identifier = SkipTo(UndParam)
Value = Word(nums)
Entry = Identifier + UndParam + Value

When we run this from the interactive interpreter, we can see the following:

>>> Entry.parseString('ABC_123_SPEED_X 123')
(['ABC_123', 'SPEED_X', '123'], {})

Note that this is a compromise; because I use SkipTo, the Identifier can be full of evil, disgusting characters, not just beautiful alphanums with the occasional underscore.

EDIT: Thanks to Paul McGuire, we can concoct a truly elegant solution by setting Identifier to the following:

Identifier = Combine(Word(alphanums) +
        ZeroOrMore('_' + ~Parameter + Word(alphanums)))

Let's inspect how this works. First, ignore the outer Combine; we'll get to this later. Starting with Word(alphanums) we know we'll get the 'ABC' part of the reference string, 'ABC_123_SPEED_X 123'. It's important to note that we didn't allow the "word" to contain underscores in this case. We build that separately in to the logic.

Next, we need to capture the '_123' part without also sucking in '_SPEED_X'. Let's also skip over ZeroOrMore at this point and return to it later. We start with the underscore as a Literal, but we can shortcut with just '_', which will get us the leading underscore, but not all of '_123'. Instictively, we would place another Word(alphanums) to capture the rest, but that's exactly what will get us in trouble by consuming all of the remaining '_123_SPEED_X'. Instead, we say, "So long as what follows the underscore is not the Parameter, parse that as part of my Identifier. We state that in pyparsing terms as '_' + ~Parameter + Word(alphanums). Since we assume we can have an arbitrary number of underscore + WordButNotParameter repeats, we wrap that expression a ZeroOrMore construct. (If you always expect at least underscore + WordButNotParameter following the initial, you can use OneOrMore.)

Finally, we need to wrap the initial Word and the special underscore + Word repeats together so that it's understood they are contiguous, not separated by whitespace, so we wrap the whole expression up in a Combine construct. This way 'ABC _123_SPEED_X' will raise a parse error, but 'ABC_123_SPEED_X' will parse correctly.

Note also that I had to change Keyword to Literal because the ways of the former are far too subtle and quick to anger. I do not trust Keywords, nor could I get matching with them.

gotgenes 2009-12-15 07:46:51

This may be working ... I am trying to figure out why it works for Parameter declared as a collection of Literals, and will not work if Parameter is declared as a collection of Keywords. Thank you! This may just be the answer!

Arrieta 2009-12-15 18:05:42

It has something to do with the fact that the underscore is in the default `identChars` for `Keyword`. http://crpppc19.epfl.ch/doc/python-pyparsing/htmldoc/pyparsing.pyparsing.Keyword-class.html If you use `Keyword('SPEED_X', identChars=alphanums)` you will get matching. But I would stick with `Literal`.

gotgenes 2009-12-15 18:54:05

Redefine identifier as `Combine(Word(alphanums)+ZeroOrMore('_'+~Parameter+Word(alphanums)))` and I think this will hit the mark. (BTW, I'm thrilled to see more pyparsing users pitching in on these SO questions!)

Paul McGuire 2009-12-15 20:47:19

Thanks for that solution, Paul. I couldn't figure out the right way to put the parts together this morning. Your solution is very elegant so I added it to the answer and explained it.

gotgenes 2009-12-15 23:20:37

Answer 3

+1 A:

If you are sure that the identifier never ends with an underscore, you can enforce it in the definition:

from pyparsing import *

my_string = 'ABC_123_SPEED_X 123'

Identifier = Combine(Word(alphanums) + Literal('_') + Word(alphanums))
Parameter = Literal('SPEED_X') | Literal('SPEED_Y') | Literal('SPEED_Z')
Value = Word(nums)
Entry = Identifier + Literal('_').suppress() + Parameter  + Value
tokens = Entry.parseString(my_string)

print tokens # prints: ['ABC_123', 'SPEED_X', '123']

If it's not the case but if the identifier length is fixed you can define Identifier like this:

Identifier = Word( alphanums + '_' , exact=7)

bertrandchenal 2009-12-15 09:55:02

Unfortunately this alternative will not work if the Identifier has more than two components. For instance. It break for my_string='ABC_123_Hola_SPEED_X 123'.

Arrieta 2009-12-15 18:13:31

Very close, Identifier just needs a little more tolerance for multiple components, plus the negative lookahead `~Parameter` to avoid accidentally reading the parameter as part of the Identifier.

Paul McGuire 2009-12-15 23:59:42

Answer 4

+1 A:

You can also parse the identifier and parameter as one token, and split them in a parse action:

from pyparsing import *
import re

def split_ident_and_param(tokens):
    mo = re.match(r"^(.*?_.*?)_(.*?_.*?)$", tokens[0])
    return [mo.group(1), mo.group(2)]

ident_and_param = Word(alphanums + "_").setParseAction(split_ident_and_param)
value = Word(nums)
entry = ident_and_param + value

print entry.parseString("APC_123_SPEED_X 123")

The example above assumes that the identifiers and parameters always have the format XXX_YYY (containing one single underscore).

If this is not the case, you need to adjust the split_ident_and_param() method.

codeape 2009-12-15 10:31:12

+1 for using a parse action. Unfortunately, the OP posted only one flavor of Identifier, and we learn in a later post that there might be *more* than 2 underscore-separated components. I suspect it is also as likely that some identifiers will have only a single component, so the parser really has to take care with those underscores. Such is often the way with parsing questions, the example data is often a very small and special subset of the larger set of possible and likely inputs.

Paul McGuire 2009-12-16 00:03:40

ansaurus

tags:

views:

answers:

Keyword Matching in Pyparsing: non-greedy slurping of tokens

related questions