Best way to tokenize and parse programming languages in my application

views:

215

answers:

+2 Q:

Best way to tokenize and parse programming languages in my application

I'm working on a tool that will perform some simple transformations on programs (like extract method). To do this, I will have to perform the first few steps of compilation (tokenizing, parsing and possibly building a symbol table). I'm going to start with C and then hopefully extend this out to support multiple languages.

My question is, what's the best way to perform these steps that:

1.) Doesn't reinvent the wheel. Clearly I don't want to write Flex/Bison specifications by hand. Do I just grab pre-existing specifications and work from there? Is Antlr the way to go here?

2.) Is extendable to multiple languages. Obviously lexing/parsing will be different for everybody, but I would like a solution that I could easily extend to other languages. At least a set of technologies that would make this manageable.

BTW, I'm using C to write my applications

If anyone has any ideas that would be great! Thanks!

+1 A:

You didn't specify a language, so I'll just recommend this little gem I found the other day:

http://irony.codeplex.com/

It's super simple to use, and even has grammars pre-built for several languages (C# even). There's also pyparsing (http://pyparsing.wikispaces.com/) if you want to use Python as your source language.

Timothy Baldridge 2010-02-12 15:25:57

Ha, sorry about that. I'm using C.

ChrisDiRulli 2010-02-12 15:32:03

+3 A:

hands down the best way to do any parsing is ANTLR. There are two great books on the subject by the author that are must haves. The Definitive ANTLR Reference: Building Domain Specific Languages, and Language Implementation Patterns, both are invaluable resources. ANTLR can generate processing code in lots of different languages.

fuzzy lollipop 2010-02-12 15:31:06

A door to go through is Eclipse. It has parsing, including error tolerant parsing, for a variety of languages. Eclipse has an internal modularity that allows you to exploit this functionality without touching the IDE.

bmargulies 2010-02-12 15:31:30

that seems like a overly complex way to go about it. Use an ide to build a utility?

Timothy Baldridge 2010-02-12 17:15:30

I don't know the code base of Eclipse but if you can extract the parsing code easily and already code in Java this might in fact be a great idea. If you are coding in C you could also have a look at gcc frontends for different languages since the gcc folks already put a lot of effort into generating a common representation from parsers for all kinds of languages so they can use the same code generation routines for different languages. But be warned, the code base is massive and it might be easier to just roll your own parser.

ahe 2010-02-13 23:57:30

+1 A:

Since you are going to use already written grammars and regular expressions you choice of the tool is ininfluent.

You can go with flex / bison and you will find many grammars already written. Otherwise you can go with ANTLR that should work on C, C++ and Java without problems and do the same thing also for it.

You didn't speak about what language are you going to use for this work so suggesting a better approach is not so easy.

Think about the fact that every language has its own features, for example symbol table are constructed in a different way in Ruby compared to C++. That's because you can have stricter or looser declarations and so on.. so you should think well what you are going to need (and you can explain it in your question too, so I can give better help).

Of your two phases I can say that

Tokenizing is quite simple, doesn't require different structures for every language and can be easily extended to support a plethora of programming languages..
Parsing can be more difficult. You have to build up an Abstract Syntax Tree of the program and then do whatever you want on it. If you like to do it OOP style you'll have to use a class for every node type, but node types can change between languages because they're structurally different so doing something general and easily extendable to other language it's quite tricky..

For this point ANTLR wins over Flex and Bison because it offers an automatic generation of AST (if I remember well).

The main difference between these two compiler's compilers is the fact that ANTLR uses an LL(k) parser (that is top-down) while Bison uses a LALR(1) that is bottom-up but if you use already written grammars that shouldn't be that difficult.

A personal advice: I wrote many interpreters or compilers but never started from a fully-featured language. C syntax is really big so maybe you should start from a subset, then see what you can do with tokens and AST and later extend it to support full syntax.

Jack 2010-02-12 15:33:37

Unfortunately antlr doesn't generate ASTs automatically. You have to write rules for that, otherwise you get a linked list of tokens as AST.Apart from that you really want to write rules since at least for every non-trivial language you want to have abstract nodes in your AST.To give an example if you parse a FQ Java classname like 'java.util.List' you don't want to just get the ID nodes 'java', 'util', 'List' and two nodes for DOT tokens in between but you want to have an abstract 'CLASSNAME' node above so that just from looking at your AST you know what those tokens mean.

ahe 2010-02-13 23:48:21

+2 A:

What language do you write your program in?

I'd go with antlr (and actually I go for parsing Java). It supports a lot of languages and also has a lot of example grammars that you get for free http://www.antlr.org/grammar/list. Unfortunately they don't have to be perfect (the Java grammar has no AST rules) but they give you a good start and I suppose the community is quite big for a parser generator.

The great thing with antlr apart from the many language targets is that LL(*) combinded with the predicates supported by antlr is very powerful a easy to understand and the generated parsers are too.

With "extendable to multiple languages" I suppose you mean multiple source languages. This isn't easy but I suppose you might have some success when translating them to ASTs that have as much common symbols as possible and writing a general tree walker that can handle the differences in those languages. But this could be quite difficult.

Be warned, though, that the online documentation is only good once you've read the official antlr book and understand LL(*) and semantic and syntactic predicates.

ahe 2010-02-12 15:40:40

ansaurus

tags:

views:

answers:

Best way to tokenize and parse programming languages in my application

related questions