views:

392

answers:

2

I'm trying to build up some skills in lexing/parsing grammars. I'm looking back on a simple parser I wrote for SQL, and I'm not altogether happy with it -- it seems like there should have been an easier way to write the parser.

SQL tripped me up because it has a lot of optional tokens and repitition. For example:

SELECT *
FROM t1
INNER JOIN t2
INNER JOIN t3
WHERE t1.ID = t2.ID and t1.ID = t3.ID

Is equivalent to:

SELECT *
FROM t1
INNER JOIN t2 ON t1.ID = t2.ID
INNER JOIN t3 on t1.ID = t3.ID

The ON and WHERE clauses are optional and can occur more than once. I handled these in my parser as follows:

%{
open AST
%}

// ...    
%token <string> ID   
%token AND OR COMMA
%token EQ LT LE GT GE
%token JOIN INNER LEFT RIGHT ON
// ...

%%

op: EQ { Eq } | LT { Lt } | LE { Le } | GT { Gt } | GE { Ge }

// WHERE clause is optional    
whereClause:
    |                       { None }
    | WHERE whereList       { Some($2) }        

whereList:
    | ID op ID                    { Cond($1, $2, $3) }
    | ID op ID AND whereList      { And(Cond($1, $2, $3), $5) }
    | ID op ID OR whereList       { Or(Cond($1, $2, $3), $5) }


// Join clause is optional    
joinList:
    |                       { [] }
    | joinClause            { [$1] }
    | joinClause joinList   { $1 :: $2 }

joinClause:
    | INNER JOIN ID joinOnClause    { $3, Inner, $4 }
    | LEFT JOIN ID joinOnClause     { $3, Left, $4 }
    | RIGHT JOIN ID joinOnClause    { $3, Right, $4 }

// "On" clause is optional    
joinOnClause:
    |                       { None }
    | ON whereList          { Some($2) }

// ...
%%

In other words, I handled optional syntax by breaking it into seperate rules, and handled repitition using recusion. This works, but it breaks parsing into a bunch of little subroutines, its very hard to see what the grammar actually represents.

I think it would be much easier to write if I could specify optional syntax inside brackets and repitition with a * or +. This would reduce my whereClause and joinList rules to the following:

whereClause:
    |                       { None }
//    $1    $2, I envision $2 being return as an (ID * op * ID * cond) list
    | WHERE [ ID op ID { (AND | OR) }]+
        { Some([for name1, op, name2, _ in $1 -> name1, op, name2]) }

joinClause:
    |                       { None }

//    $1, returned as (joinType
//                       * JOIN
//                       * ID
//                       * ((ID * op * ID) list) option) list
    | [ (INNER | LEFT | RIGHT) JOIN ID [ON whereClause] ]*
        { let joinType, _, table, onClause = $1;
          Some(table, joinType, onClause) }

I think this form is much easier to read and expresses the grammar its trying to capture more intuitively. Unfortunately, I can't find anything in either the Ocaml or F# documentation which supports this notation or anything similar.

Is there an easy way to represent grammars with optional or repetitive tokens in OcamlYacc or FsYacc?

+2  A: 

When you compose all the little pieces you should get something like you want though. Instead of:

(INNER | RIGHT | LEFT )

you just have

inner_right_left

and define that to be the union of those three keywords.

You can also define the union in the lexer. in the way you define the tokens, or using camlp4, I haven't done much in that area, so I cannot advise you to take those routes. And I don't think they'll work for you as well as just having little pieces everywhere.

EDIT:

So, for camlp4 you can look at Camlp4 Grammar module and a tutorial and a better tutorial. This isn't exactly what you want, mind you, but it's pretty close. The documentation is pretty bad, as expressed in the recent discussion on the ocaml groups, but for this specific area, I don't think you'll have too many problems. I did a little with it and can field more questions.

nlucaroni
+1  A: 

Menhir allows to parametrize nonterminal symbols by another symbols and provides the library of standard shortcuts, like optionals and lists, and you can create your own. Example:

option(X): x=X { Some x} 
         | { None }

There is also some syntax sugar, 'token?' is equivalent to 'option(token)', 'token+' to 'nonempty_list(token)'.

All of this really shortens grammar definition. Also it is supported by ocamlbuild and can be a drop-in replacement for ocamlyacc. Highly recommended!

Funny, I used it to parse SQL too :)

ygrek