views:

120

answers:

3

I'm trying to parse some text using PyParser. The problem is that I have names that can contain white spaces. So my input might look like this. First, a list of names:

Joe
bob
Jimmy X
grjiaer-rreaijgr Y

Then, things they do:

Joe A
bob B
Jimmy X C

the problem of course is that a thing they do can be the same as the end of the name:

Jimmy X X
grjiaer-rreaijgr Y Y

How can I create a parser for the action lines? The output of parsing Joe A should be [Joe, A]. The output of parsing Jimmy X C should be [Jimmy X, C], of Jimmy X X - [Jimmy X, X]. That is, [name, action] pairs.

If I create my name parser naively, meaning something like OneOrMore(RegEx("\S*")), then it will match the entire line giving me [Jimmy X X] followed by a parsing error for not seeing an action (since it was already consumed by the name parser).

NOTE: Sorry for the ambiguous phrasing earlier that made this look like an NLP question.

+2  A: 

You pretty much need more than a simple parser. Parsers use the symbols in a string to define which pieces of the string represent different elements of a grammar. This is why FM asked for some clue to indicate how you know what part is the name and what part is the rest of the sentence. If you could say that names are made up of one or more capitalized words, then the parser would know when the name stops and the rest of the sentence starts.

But a name like "jimmy foo decides"? How can the parser know just by looking at the symbols in "decides" whether "decides" is or is not part of the name? Even a human reading your "jimmy foo decides decides to eat" sentence would have some trouble determining where the name starts or stops, and whether this was some sort of typo.

If your input is really this unpredictable, then you need to use a tool such as the NLTK (Natural Language Toolkit). I've not used it myself, but it approaches this problem from the standpoint of parsing sentences in a language, as opposed to trying to parse structured data or mathematical formats.

I would not recommend pyparsing for this kind of language interpretation.

Paul McGuire
it's not that unpredictable. there are only 3-4 possible phrases, and they all have a literal ending (for example, "decides to eat", "goes to the market"). I can parse this normally by just doing str.split() on the phrase "decides to eat" and looking at the name. i just want to see how to do it flexibly from pyparsing
Claudiu
i'll edit the question to make this clearer
Claudiu
A: 

Looks like you need nltk, not pyparsing. Looks like you need a tractable problem to work on. How do YOU know how to parse 'jimmy foo decides decides to eat'? What rules do YOU use to deduce (contrary to what most people would assume) that "decides decides" is not a typo?

Re "names that can contain whitespaces": Firstly, I'd hope that you'd normalise that into one space. Secondly: this is unexpected?? Thirdly: names can contain apostrophes and hyphens (O'Brien, Montagu-Douglas-Scott) and may have components that aren't capitalised e.g. Georg von und zu Hohenlohe) and we won't mention Unicode.

John Machin
dashes and other symbols are fine. if the name had no whitespace, i could just parse it with a Regex(\S+) and be done with it. the whitespace complicates the issue because it also separates the name from the rest of the phrase.
Claudiu
A: 

Have fun:

from pyparsing import Regex, oneOf

THE_NAMES = \
"""Joe
bob
Jimmy X
grjiaer-rreaijgr Y
"""

THE_THINGS_THEY_DO = \
"""Joe A
bob B
Jimmy X C
Jimmy X X
grjiaer-rreaijgr Y Y
"""

ACTION = Regex('.*')
NAMES = THE_NAMES.splitlines()
print NAMES
GRAMMAR = oneOf(NAMES) + ACTION    
for line in THE_THINGS_THEY_DO.splitlines():
    print GRAMMAR.parseString(line)
Tal Weiss
yep this is the approach i would eventually have used had i continued this way. the problem, i realized later, is that sometimes there are names appearing in the action list that don't appear at the top
Claudiu
Please add an example. From your description an action list item may be "Tal Holech Lishon", in which case you will have to somehow guess if "Holech" is Tal's last name, or is "Holech Lishon" some action you never heard of before.
Tal Weiss
i've moved on from this for now, but the rule is: I always know what a possible action is, but I don't know what all the names are. so in terms of non-pyparsing technique, I could technically do a reverse-search on each line for each possible action, and if I found one, I'd know that stuff to the left of it is a name. but how to encode that in pyparsing?
Claudiu