tags:

views:

2093

answers:

16

We've used ANTLR to create a parser for a SQL-like grammar, and while the results are satisfactory in most cases, there are a few edge cases that we need to fix; and since we didn't write the parser ourselves we don't really understand it well enough to be able to make sensible changes.

So, we'd like to write our own parser. What's the best way to go about writing a parser by hand? What sort of parser should we use - recursive descent has been recommended; is that right? We'll be writing it in C#, so any tutorials for writing parsers in that language would be gratefully received.

UPDATE: I'd also be interested in answers that involve F# - I've been looking for a reason to use that in a project.

+1  A: 

In C/Unix, the traditional way is to use lex and yacc. With GNU, the equivalent tools are flex and bison. I don't know for Windows/C#.

mouviciel
I'd like to write it by hand - without using a parser generator.
Simon
flex, while GPLed, is not a GNU project.
Eric
+6  A: 

Recursive descent will give you the simplest way to go, but I would have to agree with mouviciel that flex and bison and definitely worth learning. When you find out you have a mistake in your grammar, fixing a definition of the language in flex /bison will be a hell of a lot easier then rewriting your recursive descent code.

FYI the C# parser is written recursive descent and it tends to be quite robust.

Spence
A hell of a lot easier is an exageration. The methods in a recursive descent parser are the grammar (and its simple once this is recognized) ...
mike g
Depends on where your mistake is. WIth recursive descent you may semantically change the meaning of an entire section of the grammar. In a flex/bison definition this would result in a 1 line change. In recursive descent you may have to rewrite half your code to deal with it.
Spence
A: 

Well if you don't mind using another compiler compiler tool like ANTLR I suggest you take a look at Coco/R

I've used it in the past and it was pretty good...

bruno conde
No, we're pretty much decided on writing one by hand. Thanks, though.
Simon
+5  A: 

The only kind of parser that can be handwritten by a sane human being is a recursive-descent. That said, writing bottom-up parser by hand is still possible but is very undesirable.

If you're up for RD parser you have to verify that SQL grammar is not left-recursive (and eliminate the recursion if necessary), and then basically write a function for each grammar rule. See this for further reference.

Anton Gogolev
+3  A: 

If you want to write it by hand, recursive decent is the most sensible way to go.

You could use a table parser, but that will be extremely hard to maintain.

Example:

Data = Object | Value;
Value = Ident, '=', Literal;
Object = '{', DataList, '}';
DataList = Data | DataList, Data;


ParseData {
  if PeekToken = '{' then 
    return ParseObject;
  if PeekToken = Ident then
    return ParseValue;
  return Error;
}

ParseValue {
  ident = TokenValue;
  if NextToken <> '=' then 
    return Error;
  if NextToken <> Literal then
    return Error;
  return(ident, TokenValue);
 }

ParseObject {
  AssertToken('{');
  temp = ParseDataList;
  AssertToken('}');
  return temp;
}

ParseDataList {
  data = ParseData;
  temp = []
  while Ok(data) {
    temp = temp + data;
    data = ParseData;
  }
}
Gamecat
+1  A: 

If I were you I would have another go at ANTLRv3 using the GUI ANTLRWorks which gives you a very convenient way of testing your grammar. We use ANTLR in our project and although the learning curve may be a bit steep in the beginning once you learn it is quite convenient. Also on their email newsletter there are a lot of people who are very helpful.

PS. IIRC they also have a SQL-grammar you could take a look at.

hth

Anders K.
Thanks but I'd like to write it by hand.
Simon
+3  A: 

Recursive Descent parsers are indeed the best, maybe only, parsers that can be built by hand. You will still have to bone-up on what exactly a formal, context-free language is and put your language in a normal form. I would personally suggest that you remove left-recursion and put your language in Greibach Normal Form. When you do that, the parser just about writes itself.

For example, this production:

A => aC 
A => bD
A => eF

becomes something simple like:

int A() {
   chr = read();
   switch char
     case 'a': C();
     case 'b': D();
     case 'e': F();
     default: throw_syntax_error('A', chr);
}

And there aren't any much more difficult cases here (what's more difficult is make sure your grammar is in exactly the right form but this allows you the control that you mentioned).

Anton's Link is also seems excellent.

Joe Soul-bringer
+1 for Greibach Normal Form - takes me back to Uni days
pjp
+3  A: 

Adding my voice to the chorus in favor of recursive-descent (LL1). They are simple, fast, and IMO, not at all hard to maintain.

However, take a good look at your language to make sure it is LL1. If you have any syntax like C has, like ((((type))foo)[ ]) where you might have to descend multiple layers of parentheses before you even find out if you are looking at a type, variable, or expression, then LL1 will be very difficult, and bottom-up wins.

Mike Dunlavey
And the right thing to do is to change the language to LL1
Stephan Eggermont
@Stephan: I'm inclined to agree, though on one or two occasions I've had no choice - I had to parse C.
Mike Dunlavey
Or you just have logic in your parser if(type) ... else if(variable) ...
mike g
A: 

The reason you don't want a table-driven parser is that you will not be able to create sensible error-messages. That is ok for a generated language, but not one where humans are involved. The error messages produced by c-like language compilers provide ample evidence that people can adapt to anything, no matter how bad.

Stephan Eggermont
+2  A: 

I suggest that you don't write the lexer by hand - use flex or similar. The task of recognising tokens is not that hard to do by hand, but I don't think you'd gain much.

As others have said, recursive descent parsers are easiest to write by hand. Otherwise you have to maintain the table of state transitions for each token, which isn't really human-readable.

I'm pretty sure ANTLR implements a recursive descent parser anyway: there's a mention of it in an interview about ANTLR 3.0.

I've also found a series of blog posts about writing a parser in C#. It seems quite gentle.

Mike Houston
A: 

I would also vote for using an existing parser + lexer.

The only reason I can see in doing it by hand:

  • if you want something relatively simple (like validating/parsing some input)
  • to learn/understand the principles.
Mihai Nita
A: 

JFlex is a flex implementation for Java, and there is now a C# port of that project http://sourceforge.net/projects/csflex/. There also appears to be a C# port of CUP in progress, which can be found here: http://sourceforge.net/projects/datagraph/

I too would recommend avoiding hand crafting your own solution. I tried this once myself for a very simple language (part of a university project) and it was incredibly time consuming and difficult. It is also exceedingly hard to maintain and change once written.

Using an existing parser generator is the way to go, as the bulk of the hard work has been done and has been well tested over the years.

xan
If your code "looks like" EBNF, then your code is only marginally more difficult to maintain than the EBNF. One should always start with the grammar when crafting a parser, or changing it.
Jason D
A: 

Look at gplex and gppg, lexer and parser generators for .NET. They work well, and are based on the same (and almost compatible) input as lex and yacc, which is relatively easy to use.

leppie
+1  A: 

At the risk of offending the OP, writing a parser for a large langauge like some specific vendor's SQL by hand when good parser generator tools (such as ANTLR) are available is simply crazy. You'll spend far more time rewriting your parser by hand than you will fixing the "edge cases" using the parser generator, and you'll invariably have to go back and revise the parser anyway as the SQL standards move or you discover you misunderstood something else. If you don't understand your parsing technology well enough, invest the time to understand it. It won't take you months to figure out how to deal with the edge cases with the parser generator, and you've already admitted it you are willing to spend months doing it by hand.

Having said that, if you are hell-bent on redoing it manually, using a recursive descent parser is the best way to do it by hand.

(NOTE: OP clarified below that he wants a reference implementation. IMHO you should NEVER code a reference implementation of a grammar procedurally, so recursive descent is out.)

Ira Baxter
No offence taken. Rewriting the parser as standards change isn't a problem here; it's our own SQL-like grammar, and the parser will be the reference implementation.
Simon
If its a reference implementation then DO NOT code a recursive descent parser. RD parsers are all procedural code, and that's much harder to understand than a specification. You absolutely want a clean, clean specificaiton of your langauge in clearly defined langauge rules so that others ("refererence", remember?) can clearly see what was specified. ANTLR is far better for this than RD. Bison using the GLR option would be better than ANTLR because you can code context-free grammar rules with impunity.
Ira Baxter
+1 for the point about reference implementations. By using a parser generator you validate the published grammar, and avoid problems that your (hand written) parser doesn't actually implement the grammar.
Stephen C
+8  A: 

I would highly recommend the F# language as your language of choice for parsing on the .NET Platform. It's roots in the ML family of languages means it has excellent support for language-oriented programming.

Discriminated unions and pattern-matching allow for a very succinct and powerful specification of your AST. Higher-order functions allow for definition of parse operations and their composition. First-class support for monadic types allows for state management to be handled implicitly greatly simplifying the composition of parsers. Powerful type-inference greatly aides the definition of these (complex) types. And all of this can be specified and executed interactively allowing you to rapidly prototype.

Stephan Tolksdorf has put this into practice with his parser combinator library FParsec

From his examples we see how naturally an AST is specified:

type expr =
    | Val of string
    | Int of int
    | Float of float
    | Decr of expr

type stmt =
    | Assign of string * expr
    | While of expr * stmt
    | Seq of stmt list
    | IfThen of expr * stmt
    | IfThenElse of expr * stmt * stmt
    | Print of expr

type prog = Prog of stmt list

the implementation of the parser (partially elided) is just as succinct:

let stmt, stmtRef = createParserForwardedToRef()

let stmtList = sepBy1 stmt (ch ';')

let assign =
    pipe2 id (str ":=" >>. expr) (fun id e -> Assign(id, e))

let print = str "print" >>. expr |>> Print

let pwhile =
    pipe2 (str "while" >>. expr) (str "do" >>. stmt) (fun e s -> While(e, s))

let seq =
    str "begin" >>. stmtList .>> str "end" |>> Seq

let ifthen =
    pipe3 (str "if" >>. expr) (str "then" >>. stmt) (opt (str "else" >>. stmt))
          (fun e s1 optS2 ->
               match optS2 with
               | None    -> IfThen(e, s1)
               | Some s2 -> IfThenElse(e, s1, s2))

do stmtRef:= choice [ifthen; pwhile; seq; print; assign]


let prog =
    ws >>. stmtList .>> eof |>> Prog

On the second line, as an example, stmt and ch are parsers and sepBy1 is a monadic parser combinator that takes two simple parsers and returns a combination parser. In this case sepBy1 p sep returns a parser that parses one or more occurrences of p separated by sep. You can thus see how quickly a powerful parser can be combined from simple parsers. F#'s support for overridden operators also allow for concise infix notation e.g. the sequencing combinator and the choice combinator can be specified as >>. and <|>.

Best of luck,

Danny

DannyAsher
+1  A: 

There is no "one best" way. Depending on your needs you may want bottom up (LALR1) or recursive descent(LLk). Articles such as this one give personal reasons for prefering LALR(1) (bottom up) to LL(k). However, each type of parser has its benefits and drawbacks. Generally LALR will be faster as a finite-state_machine is generated as a lookup table.

To pick what's right for you examine your situation; familiarize yourself with the tools and technologies. Starting with some LALR and LL Wikipedia articles isn't a bad choice. In both cases you should ALWAYS start with specifying the grammar in BNF or EBNF. I prefer EBNF for its succinctness.

Once you've gotten your mind wrapped around what you want to do, and how to represent it as a grammar, (BNF or EBNF) try a couple of different tools and run them across representative samples of text to be parsed.

Anecdotally:

However, I've heard that LL(k) is more flexible. I've never bothered to find out for myself. From my few parser building experiences I have noticed that regardles if it's LALR or LL(k) the best way to pick what's best for your needs is to start with writing the grammar. I've written my own C++ EBNF RD parser builder template library, used Lex/YACC and had coded a small R-D parser. This was spread over the better part of 15 years, and I spent no more than 2 months on the longest of the three projects.

Jason D