tags:

views:

468

answers:

2

As a purely academic exercise, I'm writing a recursive descent parser from scratch -- without using ANTLR or lex/yacc.

I'm writing a simple function which converts math expressions into their equivalent AST. I have the following:

// grammar
type expr =
    | Lit of float
    | Add of expr * expr
    | Mul of expr * expr
    | Div of expr * expr
    | Sub of expr * expr

// tokens
type tokens =
    | Num of float
    | LParen | RParen
    | XPlus | XStar | XMinus | XSlash

let tokenize (input : string) =
    Regex.Matches(input.Replace(" ", ""), "\d+|[+/*\-()]")
    |> Seq.cast<Match>
    |> Seq.map (fun x -> x.Value)
    |> Seq.map (function
        | "+" -> XPlus
        | "-" -> XMinus
        | "/" -> XSlash
        | "*" -> XStar
        | "(" -> LParen
        | ")" -> RParen
        | num -> Num(float num))
    |> Seq.to_list

So, tokenize "10 * (4 + 5) - 1" returns the following token stream:

[Num 10.0; XStar; LParen; Num 4.0; XPlus; Num 5.0; RParen; XMinus; Num 1.0]

At this point, I'd like to map the token stream to its AST with respect to operator precedence:

Sub(
    Mul(
        Lit 10.0
        ,Add(Lit 4.0, Lit 5.0)
       )
    ,Lit 1.0
   )

However, I'm drawing a blank. I've never written a parser from scratch, and I don't know even in principle how to begin.

How do I convert a token stream its representative AST?

+6  A: 

Do you know about language grammars?

Assuming yes, you have a grammar with rules along the lines

...
addTerm := mulTerm addOp addTerm
         | mulTerm

addOp   := XPlus | XMinus

mulTerm := litOrParen mulOp mulTerm
         | litOrParen
...

which ends up turning into code like (writing code in browser, never compiled)

let rec AddTerm() =
    let mulTerm = MulTerm() // will parse next mul term (error if fails to parse)
    match TryAddOp with  // peek ahead in token stream to try parse
    | None -> mulTerm    // next token was not prefix for addOp rule, stop here
    | Some(ao) ->        // did parse an addOp
         let rhsMulTerm = MulTerm()
         match ao with
         | XPlus -> Add(mulTerm, rhsMulTerm)
         | XMinus -> Sub(mulTerm, rhsMulTerm)
and TryAddOp() =
    let next = tokens.Peek() 
    match next with
    | XPlus | XMinus ->
        tokens.ConsumeNext()
        Some(next)
    | _ -> None
...

Hopefully you see the basic idea. This assumes a global mutable token stream that allows both 'peek at next token' and 'consume next token'.

Brian
+1. Also make sure the grammar is *not* left-recursive.
Mehrdad Afshari
A: 

If I remember from college classes the idea was to build expression trees like:

<program> --> <expression> <op> <expression> | <expression>
<expression> --> (<expression>) | <constant>
<op> --> * | - | + | /
<constant> --> <constant><constant> | [0-9]

then once you have construction your tree completely so you get something like:

            exp
       exp   op     exp
    5        +       and so on

then you run your completed tree through another program that recursively descents into the tree calculating expressions until you have an answer. If your parser doesn't understand the tree, you have a syntax error. Hope that helps.

Chad