views:

250

answers:

2

I'm creating a compiler with Lex and YACC (actually GNU Flex and Bison). The language allows unlimited forward references to any symbol (like C#). The problem is that it's impossible to parse the language without knowing what an identifier is.

The only solution I know of is to lex the entire source, and then do a "breadth-first" parse, so higher level things like class declarations and function declarations get parsed before the functions that use them. However, this would take a large amount of memory for big files, and it would be hard to handle with YACC (I would have to create separate grammars for each type of declaration/body). I would also have to hand-write the lexer (which is not that much of a problem).

I don't care a whole lot about efficiency (although it still is important), because I'm going to rewrite the compiler in itself once I finish it, but I want that version to be fast (so if there are any fast general techniques that can't be done in Lex/YACC but can be done by hand, please suggest them also). So right now, ease of development is the most important factor.

Are there any good solutions to this problem? How is this usually done in compilers for languages like C# or Java?

+6  A: 

It's entirely possible to parse it. Although there is an ambiguity between identifiers and keywords, lex will happily cope with that by giving the keywords priority.

I don't see what other problems there are. You don't need to determine if identifiers are valid during the parsing stage. You are constructing either a parse tree or an abstract syntax tree (the difference is subtle, but irrelevant for the purposes of this discussion) as you parse. After that you build your nested symbol table structures by performing a pass over the AST you generated during the parse. Then you do another pass over the AST to check that identifiers used are valid. Follow this with one or more additional parses over the AST to generate the output code, or some other intermediate datastructure and you're done!

EDIT: If you want to see how it's done, check the source code for the Mono C# compiler. This is actually written in C# rather than C or C++, but it does use .NET port of Jay which is very similar to yacc.

U62
It has nothing to do with keywords. It more like this: is A.B.C (package A.B).(class C), (package A).(class B).(field C), or (fieled A).(field B).(field C), etc.
Zifre
Then the second paragraph of my answer applies. You don't need to know that to parse. Treat '.' as an operator in your grammar. In your AST passes you can then check them against the symbol table.
U62
Well, I guess I'm going to have to just make a parse tree rather than an AST. As you said they are different. If nobody else comes up with a better answer I'll accept this, but I'd really rather not do it this way...
Zifre
+1  A: 

One option is to deal with forward references by just scanning and caching tokens till you hit something you know how to real with (sort of like "panic-mode" error recovery). Once you have run thought the full file, go back and try to re parse the bits that didn't parse before.

As to having to hand write the lexer; don't, use lex to generate a normal parser and just read from it via a hand written shim that lets you go back and feed the parser from a cache as well as what lex makes.

As to making several grammars, a little fun with a preprocessor on the yacc file and you should be able to make them all out of the same original source

BCS
I'm not really worried about hand writing the lexer, it's not that hard (it might actually be slightly easier since my language has Python-like indentation). Using the preprocessor with YACC sounds like it might work, but is there any way to change the start symbol?
Zifre
Re a preprocessor with yacc, that's exactly the idea. define the full grammar without explicitly defining the start symbol and then swap out a small bit of the file (via something like #include or #define) to pick the starting point. One way to do that would be to Have the start rule of the form "Root ::= MacroRule;" and replace MacroRule with whatever you want for this version.
BCS