views:

609

answers:

3

Hi,

I have a task to write a (toy) parser for a (toy) grammar using OCaml and not sure how to start (and proceed with) this problem.

Here's a sample Awk grammar:

type ('nonterm, 'term) symbol = N of 'nonterm | T of 'term;;

type awksub_nonterminals = Expr | Term | Lvalue | Incrop | Binop | Num;;

let awksub_grammar =
  (Expr,
   function
     | Expr ->
         [[N Term; N Binop; N Expr];
          [N Term]]
     | Term ->
     [[N Num];
      [N Lvalue];
      [N Incrop; N Lvalue];
      [N Lvalue; N Incrop];
      [T"("; N Expr; T")"]]
     | Lvalue ->
     [[T"$"; N Expr]]
     | Incrop ->
     [[T"++"];
      [T"--"]]
     | Binop ->
     [[T"+"];
      [T"-"]]
     | Num ->
     [[T"0"]; [T"1"]; [T"2"]; [T"3"]; [T"4"];
      [T"5"]; [T"6"]; [T"7"]; [T"8"]; [T"9"]]);;

And here's some fragments to parse:

let frag1 = ["4"; "+"; "3"];;
let frag2 = ["9"; "+"; "$"; "1"; "+"];;

What I'm looking for is a rulelist that is the result of the parsing a fragment, such as this one for frag1 ["4"; "+"; "3"]:

 [(Expr, [N Term; N Binop; N Expr]);
  (Term, [N Num]);
  (Num, [T "3"]);
  (Binop, [T "+"]);
  (Expr, [N Term]);
  (Term, [N Num]);
  (Num, [T "4"])]

The restriction is to not use any OCaml libraries other than List... :/

+3  A: 

I'm not sure if you specifically require the derivation tree, or if this is a just a first step in parsing. I'm assuming the latter.

You could start by defining the structure of the resulting abstract syntax tree by defining types. It could be something like this:

type expr =
    | Operation of term * binop * term
    | Term of term
and term =
    | Num of num
    | Lvalue of expr
    | Incrop of incrop * expression
and incrop = Incr | Decr
and binop = Plus | Minus
and num = int

Then I'd implement a recursive descent parser. Of course it would be much nicer if you could use streams combined with the preprocessor camlp4of...

By the way, there's a small example about arithmetic expressions in the OCaml documentation here.

jdb
Thanks and you are right - what I described is a first step in a process of creating a matcher that finds a prefix which matches the grammar, then passes it on to an acceptor...
DV
I'm working on writing the recursive function necessary to do the parsing... So far it's quite painful.
DV
+3  A: 

Ok, so the first think you should do is write a lexical analyser. That's the function that takes the ‘raw’ input, like ["3"; "-"; "("; "4"; "+"; "2"; ")"], and splits it into a list of tokens (that is, representations of terminal symbols).

You can define a token to be

type token =
    | TokInt of int         (* an integer *)
    | TokBinOp of binop     (* a binary operator *)
    | TokOParen             (* an opening parenthesis *) 
    | TokCParen             (* a closing parenthesis *)     
and binop = Plus | Minus

The type of the lexer function would be string list -> token list and the ouput of

lexer ["3"; "-"; "("; "4"; "+"; "2"; ")"]

would be something like

[   TokInt 3; TokBinOp Minus; TokOParen; TokInt 4;
    TBinOp Plus; TokInt 2; TokCParen   ]

This will make the job of writing the parser easier, because you won't have to worry about recognising what is a integer, what is an operator, etc.

This is a first, not too difficult step because the tokens are already separated. All the lexer has to do is identify them.

When this is done, you can write a more realistic lexical analyser, of type string -> token list, that takes a actual raw input, such as "3-(4+2)" and turns it into a token list.

jdb
Thanks, I'll give this a try and update soon!
DV
No need for lexer as the fragments to parse are represented as lists already. The grammar is left-factored so just descend recursively using the input list - straightforwardly.
ygrek
@ygrek: But it's gonna be easier to write the parser with pattern-matching. It's much more painful to make the matcher understand the difference between `"342"` and `"++"` (they're both strings) than the one between `TokInt` and `TokBinOp`. Plus the OP may want to parse a string instead of a list some day.
jdb
Look at the grammar - "342" is not allowed, so the terminals are just compared as is. Mind, when descending from top to bottom the parser doesn't need to distinguish tokens "342" and "++" -- it will just try to match the current input with all then terminals in the current branch in order. As for me, separate lexer is an unnecessary complication here.
ygrek
+3  A: 

Here is a rough sketch - straightforwardly descend into the grammar and try each branch in order. Possible optimization : tail recursion for single non-terminal in a branch.

exception Backtrack

let parse l =
  let rules = snd awksub_grammar in
  let rec descend gram l =
    let rec loop = function 
      | [] -> raise Backtrack
      | x::xs -> try attempt x l with Backtrack -> loop xs
    in
    loop (rules gram)
  and attempt branch (path,tokens) =
    match branch, tokens with
    | T x :: branch' , h::tokens' when h = x -> 
        attempt branch' ((T x :: path),tokens')
    | N n :: branch' , _ -> 
        let (path',tokens) = descend n ((N n :: path),tokens) in 
        attempt branch' (path', tokens)
    | [], _ -> path,tokens
    | _, _ -> raise Backtrack
  in
  let (path,tail) = descend (fst awksub_grammar) ([],l) in
  tail, List.rev path
ygrek