views:

352

answers:

1

I'm trying to parse a string in a self-made language into a sort of tree, e.g.:

# a * b1 b2 -> c * d1 d2 -> e # f1 f2 * g

should result in:

# a
  * b1 b2
    -> c
  * d1 d2
    -> e
# f1 f2
  * g

#, * and -> are symbols. a, b1, etc. are texts.

Since the moment I know only rpn method to evaluate expressions, and my current solution is as follows. If I allow only a single text token after each symbol I can easily convert expression first into RPN notation (b = b1 b2; d = d1 d2; f = f1 f2) and parse it from here:

a b c -> * d e -> * # f g * #

However, merging text tokens and whatever else comes seems to be problematic. My idea was to create marker tokens (M), so RPN looks like:

a M b2 b1 M c -> * M d2 d1 M e -> * # f2 f1 M g * #

which is also parseable and seems to solve the problem.

That said:

  1. Does anyone have experience with something like that and can say it is or it is not a viable solution for the future?
  2. Are there better methods for parsing expressions with undefined arity of operators?
  3. Can you point me at some good resources?

Note. Yes, I know this example very much resembles Lisp prefix notation and maybe the way to go would be to add some brackets, but I don't have any experience here. However, the source text must not contain any artificial brackets and also I'm not sure what to do about potential infix mixins like # a * b -> [if value1 = value2] c -> d.

Thanks for any help.

EDIT: It seems that what I'm looking for are sources on postfix notation with a variable number of arguments.

+2  A: 

I couldn't fully understand your question, but it seems what you want is a grammar definition and a parser generator. I suggest you take a look at ANTLR, it should be pretty straightforward with it to define a grammar for either your original syntax or the RPN.

Edit: (After exercising self-criticism, and making some effort to understand the question details.) Actually, the language grammar is unclear from your example. However, it seems to me, that the advantages of the prefix/postfix notations (i.e. that you need neither parentheses nor a precedence-aware parser) stem from the fact that you know the number of arguments every time you encounter an operator, therefore you know exactly how many elements to read (for prefix notation) or to pop from the stack (for postfix notation). OTOH, I beleive that having operators which can have variable number of arguments makes prefix/postfix notations not simply difficult to parse but outright ambiguous. Take the following expression for example:

# a * b c d

Which of the following three is the canonical form?

  1. #(a, *(b, c, d))
  2. #(a, *(b, c), d)
  3. #(a, *(b), c, d)

Without knowing more about the operators, it is impossible to tell. Of course you could define some sort of greedyness of the operators, e.g. * is greedier than #, so it gobbles up all the arguments. But this would beat the purpose of a prefix notation, because you simply wouldn't be able to write down the second variant from the above three; not without additinonal syntactic elements.

Now that I think of it, it is probably not by sheer chance that none of the programming languages I know support operators with a variable number of arguments, only functions/procedures.

David Hanak
Dear David, thanks for your time and for ANTLR link. What I'm actually doing is not a programming language and probably I've misled you by using term "operator". Real purpose of the language is a human-friendly serialization of a tree. Canonical form is 1 but I may introduce "end" lexems like /*
ctd.: So # a * b c /* d will result in #(a *(b, c), d).I am also happy to report that approach using articial marker lexems seems to be working so far.
#a * b c /* d would than be rpn'ed into:M a M c b * d #
And also the whole thing starts to resemble TeX