views:

105

answers:

1

I'm trying to lex (then parse) a C like language. In C there are preprocessor directives where line breaks are significant, then the actual code where they are just whitespace.

One way of doing this would be do a two pass process like early C compilers - have a separate preprocessor for the # directives, then lex the output of that.

However, I wondered if it was possible to do it in a single lexer. I'm pretty happy with writing the scala parser-combinator code, but I'm not so sure of how StdLexical handles whitespace.

Could someone write some simple sample code which say could lex a #include line (using the newline) and some trivial code (ignoring the newline)? Or is this not possible, and it is better to go with the 2-pass appproach?

+5  A: 

OK, solved this myself, answer here for posterity.

In StdLexical you already have the ability to specify whitespace in your lexer. All you have to do is override your token method appropriately. Here is some sample code (with non relevant bits removed)

override def token: CeeLexer.Parser[Token] = controlLine 
  // | ... (where ... is whatever you want to keep of the original method)
def controlLine = hashInclude

def hashInclude : CeeLexer.Parser[HashInclude] =
  ('#' ~ word("include") ~ rep(nonEolws)~'\"' ~ rep(chrExcept('\"', '\n', EofCh)) ~ '\"' ~ '\n' |
   '#' ~ word("include") ~ rep(nonEolws)~'<' ~ rep(chrExcept('>', '\n', EofCh)) ~ '>' ~ '\n' ) ^^ {
   case hash~include~whs~openQ~fname~closeQ~eol =>  // code to handle #include
 }
Nick Fortescue