views:

1446

answers:

9

My day job includes working to develop a Pascal-like compiler. I've been working all along on optimizations and code generation.

I would also like to start learning to build a simple parser for the same language. I'm however, not really sure how to go about this. Flex and Bison seem to be the choice. But, isn't it possible to write a parser using C++ or C#? I'm a bit creepy with C.

Yacc++ supports C#, but it's a licensed one. I'm looking for all the help that I can find in this regard. Suggestions would be highly appreciated.

A: 

If you were writing it in Java I'd recommend ANTLR. It's a nice LL(*) parser-generator written in Java. There's a terrific book for it on Amazon, too.

duffymo
I would love to use Java. But, I'm actually already in love with C++ and C#. However, ANTLR seems to be the way to go. And since this one's a hobby project, I'd consider building it in Java. Thanks. Any more suggestions before I start?
ANTLR can generate code for many targets: http://www.antlr.org/wiki/display/ANTLR3/Code+Generation+Targets
gimel
A: 

bison & flex are the canonical parser generators. If you're interested in C++, I've found boost spirit useful. I've never used it for anything as complex as a compiler though. I'm sure others will have interesting suggestions for other languages such as C#...

Jon
Thanks, I'll check that out too.
A: 

You can actually use flex & bison with C++. In this tutorial, for example, you can see that section 5 is dedicated to that matter. Just google for it, and I'm sure you will find lots of examples.

Marc
+5  A: 

I believe you can use ANTLR with C#. I've never tried it myself (yet), however there is a tutorial here that might point you in the right direction.

KevB
Never used ANTLR myself, but heard good things about it. It's probably what I'd try in your shoes, assuming I can find predone grammars for Pascal for it, like you can for lex+yacc.
T.E.D.
A: 

When you use Lex and Yacc you don't actually write much of anything in C. Lex is its own language, as is Yacc. So you write the lexical analyzer in Lex and the parser in Yacc. However, for Pascal, Lex and Yacc inputs are already available.

The resulting parser and lexer have C interfaces, that is true. However most languages, including C++, have simple ways to call (or wrap) C interfaces.

I'm not an expert in it, but I'm sure all of the above goes for ANTLR as well.

If you are acking to do it in "pure C++" (whatever that means), look into using boost spirit. I don't really see the point in theoretical purity if it causes a ton more work though.

Writing your own lexers and parsers by hand is actually kinda fun. A lexer is one of the very few situations where you can justify using both gotos and the preprocessor. However, I wouldn't suggest it for a full-blown language like Pascal if you can avoid it. That would be a lot of work. I'm talking man-years.

T.E.D.
+1  A: 

Personally, I roll my own lexer and parser (LL). Here's a very-abbreviated example. It is in C++, but hopefully you can adapt it. It makes use of a macro PARSE_HIGHER to make it easy to insert operators at different precedence levels without much code changing.

 // routine to scan over whitespace/comments
void ScanWhite(const char* &pc){
  while(true){
    if(0);
    else if (WHITESPACE(*pc)) pc++;
    else if (pc[0]=='/' && pc[1]=='/'){
      while(*pc && *pc++ != '\n');
    }
    else break;
  }
}
// routine to lex an identifier
bool SeeId(const char* &pc, string sId){
  ScanWhite(pc);
  const char* pc0 = pc;
  if (alpha(*pc)){
    sId = "";
    while(alphanum(*pc)) sId += (*pc++);
    return true;
  }
  pc = pc0;
  return false;
}
// routine to lex a number
bool SeeNum(const char* &pc, double &dNum){
  ScanWhite(pc);
  const char* pc0 = pc;
  if (digit(*pc)){
    dNum = 0;
    while(digit(*pc)) dNum = dNum * 10 + (*pc++ - '0');
    if (*pc == '.'){
      double divisor = 1, frac = 0;
      while(digit(*pc)){
        divisor *= 0.1;
        frac += (*pc++ - '0') * divisor;
      }
      dNum += frac;
    }
    return true;
  }
  pc = pc0;
  return false;
}
// routine to lex some constant word
bool SeeWord(const char* &pc, const char* sWord){
  ScanWhite(pc);
  const char* pc0 = pc;
  int len = strlen(sWord);
  if (strncmp(pc, sWord, len)==0 && !alphanum(pc[len])){
    pc += len;
    return true;
  }
  pc = pc0;
  return false;
}
// routine to lex a single character like an operator
bool SeeChar(const char* &pc, const char c){
  ScanWhite(pc);
  const char* pc0 = pc;
  if (*pc == c){
    pc++;
    return true;
  }
  pc = pc0;
  return false;
}
// primitive expression parser
void ParsePrimitiveExpr(const char* &pc, CNode* &p){
  double dNum;
  char sId[100];
  if (0);
  else if (SeeNum(pc, dNum)){
    p = new CNode(dNum);
  }
  else if (SeeId(pc, sId)){
    // see if its a function call
    if (SeeChar(pc, '(')){
      p = MakeNewFunctionCallNode(sId);
      while(!SeeChar(pc, ')')){
        CNode* p1 = null;
        ParseExpression(pc, p1);
        AddArgumentExprToFunctionCallNode(p, p1);
        SeeChar(pc, ','); /* optional comma separator */
      }
    }
    // otherwise its just a variable reference
    else {
      p = new CNode(sId);
    }
  }
  // handle embedded expressions
  else if (SeeChar(pc, '(')){
    ParseExpression(pc, p);
    if (!SeeChar(pc, ')')) /* deal with syntax error */
  }
}
#define PARSE_HIGHER ParsePrimitiveExpr
// product parser
void ParseProduct(const char* &pc, CNode* &p){
  PARSE_HIGHER(pc, p);
  while(true){
    if (0);
    else if (SeeChar(pc, '*')){
      CNode p1 = null;
      PARSE_HIGHER(pc, p1);
      p = new CNode('*', p, p1);
    }
    else if (SeeChar(pc, '/')){
     CNode p1 = null;
     PARSE_HIGHER(pc, p1);
     p = new CNode('/', p, p1);
   }
   else break;
  }
}
#undef  PARSE_HIGHER
#define PARSE_HIGHER ParseProduct
// sum parser
void ParseSum(const char* &pc, CNode* &p){
  PARSE_HIGHER(pc, p);
  while(true){
    if (0);
    else if (SeeChar(pc, '+')){
      CNode p1 = null;
      PARSE_HIGHER(pc, p1);
      p = new CNode('+', p, p1);
    }
    else if (SeeChar(pc, '-')){
      CNode p1 = null;
      PARSE_HIGHER(pc, p1);
      p = new CNode('-', p, p1);
    }
   else break;
  }
}
#undef  PARSE_HIGHER
// can insert more routines like the above
// to handle move operators
#define PARSE_HIGHER ParseSum
// overall expression parser
void ParseExpression(const char* &pc, CNode* &p){
  PARSE_HIGHER(pc, p);
}

Added some Pascal-style statement syntax:

void ParseStatement(const char* &pc){
  char sId[100];
  if(0);
  else if (SeeWord(pc, "begin")){
    while(!SeeWord(pc, "end")){
      ParseStatement(pc);
      SeeChar(pc, ';');
    }
  }
  else if (SeeWord(pc, "while")){
    CNode* p1 = null;
    ParseExpression(pc, p1);
    ParseStatement(pc);
    /* semantics for while statement */
  }
  else if (SeeWord(pc, "if")){
    CNode* p1 = null;
    ParseExpression(pc, p1);
    ParseStatement(pc);
    if (SeeWord(pc, "else")){
      ParseStatement(pc);
    }
    /* semantics for if statement */
  }
  else if (SeeWord(pc, "for")){
    /* you do it */
  }
  // handle assignments and subroutine calls
  else if (SeeId(pc, sId)){
    if(0);
    else if (SeeChar(pc, '=')){
      CNode* p1 = null;
      ParseExpression(pc, p1);
      /* semantics for assignment statement */
    }
    else if (SeeChar(pc, '(')){
      CNode* p = MakeNewFunctionCallNode(sId);
      while(!SeeChar(pc, ')')){
        CNode* p1 = null;
        ParseExpression(pc, p1);
        AddArgumentExprToFunctionCallNode(p, p1);
        SeeChar(pc, ','); /* optional comma separator */
      }
    }
    else {
      /* we have a 1-word statement, which is OK in pascal */
    }
  }
  else {
    /* syntax error */
  }
}

It still needs syntax for array indexing, variable declaration, and function definition, but I hope it is clear how to do that.

Mike Dunlavey
This type of parser is ok until the language gets more complex than simple expressions. Start getting into loops, if's, for's, functions and what not and you'll get into some hairy code. Maybe if it's done right it's ok, but then again what is "right."
Well, as I said, it is a very-abbreviated example. It has no problem doing any LL1 grammar. This does the expression syntax. If you like, I'll throw in some statement syntax. That's where the SeeWord lexer gets used.
Mike Dunlavey
OK, I added some statement syntax. I hope you can see how to take it from there.
Mike Dunlavey
Point being, that when it comes to languages that go beyond simple procedural programming, it's probably easier and more maintainable to use tools like LEX and YACC.
YACC is necessary for LR (shift-reduce) languages, specificaly those that have complex type-declaration syntax, such as C, C++, Java, C#. If you're doing Pascal or your own language, if you can't find a tool you like, the above technique can do any LL parser, provably so.
Mike Dunlavey
As far as maintainability, rules in a LL parser generator map 1-1 onto recursive routines (as above), and in fact that's what they can generate. Beyond that, saying it's maintainable is just saying it's personally preferable. But thanks for your input.
Mike Dunlavey
A: 

I've written an XSLT parser with flex and bison. More lately I'm doing a project using ANTLR, though:

is JFig language syntax efficient and clear (and better than Spring-Framework’s XML DSL)?

I've liked working in ANTLR much more so than Flex and Bison. ANTLR puts you up at a higher level of abstraction in some respects. The lexical definitions and parser grammar can all go in one file. (ANTLR will generate the token file.)

One of the key items is the ability to define tree grammars. Basically you do a grammar parse over the input language and have actions that rewrite to a highly optimal AST tree output (which remain as linked data structures in memory). You then can pass this tree to another parser defined in a separate tree parser file. The tree parser is where you do the real work of the action items you want.

This is a nice approach as you can keep the AST form and repeatedly reprocess it as needed - peeling off specific subtree nodes to process against based on latter actions, etc. Think of something like a language interpretor. Instead of going into a for loop and processing the language from the ground up over and over again, can just process through it's AST representation.

In my case I've devised a bean factory for doing IoC dependency injection. My bean factory keeps the AST of a bean descriptor around at runtime. When it needs to make (or retrieve) a new bean instance, I just pass the bean descriptor AST subtree to my tree parser - the outcome is the desired bean instance (there are a lot of factors that go in to determining how to instantiate the requested bean, including making any other beans that are referenced and/or applying other special behaviors via meta attributes).

Finally, my current bean factory is targeting Java, but I want to target ActionScript3 and C# .NET. ANTLR has support for doing that.

As mentioned, Terrence Parr has written a book on how to use ANTLR. It is aimed at working programmers that need to do something practical with ANTLR (as opposed to an academic treatment of the subject).

RogerV
+1  A: 

In his classic programming text, Algorithms + Data Structures = Programs, Niklaus Wirth develops an entire recursive descent parser (in Pascal) for a simple Pascal0-like language.

joel.neely
A: 

If your wanting C# as per this Question try Gardens Point GPPG and GPLEX.

Simeon Pilgrim