ansaurus

Question

pyparsing ambiguity

Answer 1

+2 A:

You pretty much need more than a simple parser. Parsers use the symbols in a string to define which pieces of the string represent different elements of a grammar. This is why FM asked for some clue to indicate how you know what part is the name and what part is the rest of the sentence. If you could say that names are made up of one or more capitalized words, then the parser would know when the name stops and the rest of the sentence starts.

But a name like "jimmy foo decides"? How can the parser know just by looking at the symbols in "decides" whether "decides" is or is not part of the name? Even a human reading your "jimmy foo decides decides to eat" sentence would have some trouble determining where the name starts or stops, and whether this was some sort of typo.

If your input is really this unpredictable, then you need to use a tool such as the NLTK (Natural Language Toolkit). I've not used it myself, but it approaches this problem from the standpoint of parsing sentences in a language, as opposed to trying to parse structured data or mathematical formats.

I would not recommend pyparsing for this kind of language interpretation.

Paul McGuire 2010-06-05 23:47:29

it's not that unpredictable. there are only 3-4 possible phrases, and they all have a literal ending (for example, "decides to eat", "goes to the market"). I can parse this normally by just doing str.split() on the phrase "decides to eat" and looking at the name. i just want to see how to do it flexibly from pyparsing

Claudiu 2010-06-06 05:53:53

i'll edit the question to make this clearer

Claudiu 2010-06-06 05:55:36

Answer 2

A:

Looks like you need nltk, not pyparsing. Looks like you need a tractable problem to work on. How do YOU know how to parse 'jimmy foo decides decides to eat'? What rules do YOU use to deduce (contrary to what most people would assume) that "decides decides" is not a typo?

Re "names that can contain whitespaces": Firstly, I'd hope that you'd normalise that into one space. Secondly: this is unexpected?? Thirdly: names can contain apostrophes and hyphens (O'Brien, Montagu-Douglas-Scott) and may have components that aren't capitalised e.g. Georg von und zu Hohenlohe) and we won't mention Unicode.

John Machin 2010-06-05 23:49:03

dashes and other symbols are fine. if the name had no whitespace, i could just parse it with a Regex(\S+) and be done with it. the whitespace complicates the issue because it also separates the name from the rest of the phrase.

Claudiu 2010-06-06 05:55:03

Answer 3

A:

Have fun:

from pyparsing import Regex, oneOf

THE_NAMES = \
"""Joe
bob
Jimmy X
grjiaer-rreaijgr Y
"""

THE_THINGS_THEY_DO = \
"""Joe A
bob B
Jimmy X C
Jimmy X X
grjiaer-rreaijgr Y Y
"""

ACTION = Regex('.*')
NAMES = THE_NAMES.splitlines()
print NAMES
GRAMMAR = oneOf(NAMES) + ACTION    
for line in THE_THINGS_THEY_DO.splitlines():
    print GRAMMAR.parseString(line)

Tal Weiss 2010-07-03 22:48:33

yep this is the approach i would eventually have used had i continued this way. the problem, i realized later, is that sometimes there are names appearing in the action list that don't appear at the top

Claudiu 2010-07-05 15:54:17

Please add an example. From your description an action list item may be "Tal Holech Lishon", in which case you will have to somehow guess if "Holech" is Tal's last name, or is "Holech Lishon" some action you never heard of before.

Tal Weiss 2010-07-05 20:00:30

i've moved on from this for now, but the rule is: I always know what a possible action is, but I don't know what all the names are. so in terms of non-pyparsing technique, I could technically do a reverse-search on each line for each possible action, and if I found one, I'd know that stuff to the left of it is a name. but how to encode that in pyparsing?

Claudiu 2010-08-11 16:32:07

ansaurus

tags:

views:

answers:

pyparsing ambiguity

related questions