ansaurus

Question

Answer 1

+3 A:

I'm not sure if you specifically require the derivation tree, or if this is a just a first step in parsing. I'm assuming the latter.

You could start by defining the structure of the resulting abstract syntax tree by defining types. It could be something like this:

type expr =
    | Operation of term * binop * term
    | Term of term
and term =
    | Num of num
    | Lvalue of expr
    | Incrop of incrop * expression
and incrop = Incr | Decr
and binop = Plus | Minus
and num = int

Then I'd implement a recursive descent parser. Of course it would be much nicer if you could use streams combined with the preprocessor camlp4of...

By the way, there's a small example about arithmetic expressions in the OCaml documentation here.

jdb 2009-10-18 21:54:53

Thanks and you are right - what I described is a first step in a process of creating a matcher that finds a prefix which matches the grammar, then passes it on to an acceptor...

DV 2009-10-19 05:37:22

I'm working on writing the recursive function necessary to do the parsing... So far it's quite painful.

DV 2009-10-19 07:26:49

Answer 2

+3 A:

Ok, so the first think you should do is write a lexical analyser. That's the function that takes the ‘raw’ input, like ["3"; "-"; "("; "4"; "+"; "2"; ")"], and splits it into a list of tokens (that is, representations of terminal symbols).

You can define a token to be

type token =
    | TokInt of int         (* an integer *)
    | TokBinOp of binop     (* a binary operator *)
    | TokOParen             (* an opening parenthesis *) 
    | TokCParen             (* a closing parenthesis *)     
and binop = Plus | Minus

The type of the lexer function would be string list -> token list and the ouput of

lexer ["3"; "-"; "("; "4"; "+"; "2"; ")"]

would be something like

[   TokInt 3; TokBinOp Minus; TokOParen; TokInt 4;
    TBinOp Plus; TokInt 2; TokCParen   ]

This will make the job of writing the parser easier, because you won't have to worry about recognising what is a integer, what is an operator, etc.

This is a first, not too difficult step because the tokens are already separated. All the lexer has to do is identify them.

When this is done, you can write a more realistic lexical analyser, of type string -> token list, that takes a actual raw input, such as "3-(4+2)" and turns it into a token list.

jdb 2009-10-19 15:32:51

Thanks, I'll give this a try and update soon!

DV 2009-10-20 02:31:21

No need for lexer as the fragments to parse are represented as lists already. The grammar is left-factored so just descend recursively using the input list - straightforwardly.

ygrek 2009-10-20 09:08:28

@ygrek: But it's gonna be easier to write the parser with pattern-matching. It's much more painful to make the matcher understand the difference between `"342"` and `"++"` (they're both strings) than the one between `TokInt` and `TokBinOp`. Plus the OP may want to parse a string instead of a list some day.

jdb 2009-10-20 16:22:42

Look at the grammar - "342" is not allowed, so the terminals are just compared as is. Mind, when descending from top to bottom the parser doesn't need to distinguish tokens "342" and "++" -- it will just try to match the current input with all then terminals in the current branch in order. As for me, separate lexer is an unnecessary complication here.

ygrek 2009-10-21 07:25:28

Answer 3

+3 A:

Here is a rough sketch - straightforwardly descend into the grammar and try each branch in order. Possible optimization : tail recursion for single non-terminal in a branch.

exception Backtrack

let parse l =
  let rules = snd awksub_grammar in
  let rec descend gram l =
    let rec loop = function 
      | [] -> raise Backtrack
      | x::xs -> try attempt x l with Backtrack -> loop xs
    in
    loop (rules gram)
  and attempt branch (path,tokens) =
    match branch, tokens with
    | T x :: branch' , h::tokens' when h = x -> 
        attempt branch' ((T x :: path),tokens')
    | N n :: branch' , _ -> 
        let (path',tokens) = descend n ((N n :: path),tokens) in 
        attempt branch' (path', tokens)
    | [], _ -> path,tokens
    | _, _ -> raise Backtrack
  in
  let (path,tail) = descend (fst awksub_grammar) ([],l) in
  tail, List.rev path

ygrek 2009-10-21 07:56:01

ansaurus

tags:

views:

answers:

Parsing grammars using OCaml

related questions