Language parsers

tags:

parsing

views:

122

answers:

+1 Q:

I need to parse C#, Ruby and Python source code to generate some reports. I need to get a list of method names inside a class, and I need some other info such as usage of global variable or something. Just parsing using RE could be a solution, but I expect a better (systematic) solution using parsers, if it is easily possible.

What parsers for those languages are provided?

For C#, I found http://csparser.codeplex.com/Wikipage , but for the others, I found a bunch of parsers using those languages, but not the language parsers of them.

+4 A:

It may be worth looking into the ANTLR parser generator.

You'll find, on the ANTLR site, grammars for all 3 languages you are interested in (Although the Ruby grammar is only for a "simplified" version of the language).

The next difficulty may be to adapt these grammars for the particular target language you would like, i.e. the language in which the parsers themselves will be generated.
ANTLR's grammar language is very expressive, allowing one to deal with various context-sensitive languages. This is done by inserting various snippets (in the target language) and/or semantic or syntactic predicates (also in the target language) amid the EBNF-like grammar; consequently the grammar is a bit messier and may need adapting when the target language is changed. The "native" target language of ANTLR is Java, but many other targets languages are supported.

On the whole, ANTLR represents a bit a setup/learning-curve effort, but since you need to deal with 3 languages, it may well be worth the investment, as this will allow you to have a uniform framework (over which you have "full" control), rather than trying to corral three possibly very distinct, and possibly more "locked down" parsers as you started doing.

All three languages are relatively sophisticated languages and although your goal is "merely" to identify methods within programs, you may be able to hack/simplify some of the grammars (or maybe simply "ignore" parts of them), only mapping the few parser-level rules of interest to your eventual goal.
Once these rules are identified, you can apply the same or similar actions, i.e. snippets (in the target language) which implement what you wish to accomplish when the parser encounters such rules (eg: store the method's signature for future reporting, start counting the number of lines... whatever).

A final suggestion:
As hinted in comments to the question, and depending on your goals, you may be able to reuse existing utility programs to perform directly, or indirectly these goals.
Also, because indeed messing with parsers for these sophisticated languages may be somewhat overkill for you possibly simple and possibly error-tolerant goals, the Regular Expressions approach may fit the bill, somehow; the fact of the matter is that none of these languages are regular nor context free, so success with regex will be highly dependent on the eventual goals and on the input data (programs).

Yet another suggestion!
See Larry Lustig's answer! Introspection may simplify much of you task as well. The implication is that you'd need to a) write your logic within each of the the underlying language b) integrate/load the programs to be inspected. All depends, but again, a possible way out from the -let's be fair- relatively heavy investment with formal grammar tools.

mjv 2010-04-15 14:52:16

+2 A:

For Ruby and Python, can't you simply introspect the class to learn the name of the methods? You'd have to write the same functionality in each language but (at least in Python) there's hardly anything to it.

Larry Lustig 2010-04-15 15:29:14

Dang! Great idea. If the OP has the ability to work "within" each of the three languages, a few lines of introspection-based logic in each of these could save him/her from the relatively heavy (but rewarding) investment with formal grammar tools! I responded directly to the quest for _a parser_ but your suggestion is well worth exploring!

mjv 2010-04-15 15:39:52

+1 A:

For Python, the situation is trivial: there is a Python parser in the standard library as well as a more high-level module for manipulating ASTs.

Also, Python has a somewhat simple grammar (at least if you use the trick to keep an indentation stack in your lexer and inject fake BEGIN and END tokens in your token stream, so that you can treat Python as a simple keyword delimited Algol-like language in your parser), so it is often used as an example grammar for parser generators, which means that you can find literally dozens of Python parsers for pretty much every single parser generator, programming language and platform out there. (E.g., here is a Haskell module implementing a Python lexer and parser.)

For Ruby, there are quite a number of parsers available.

Ruby is incredibly hard to parse, so if you need full fidelity, you pretty much have to use the original YACC grammar file from the YARV Ruby implementation. (parse.y in the top-level source directory.) JRuby's parser is derived from that file, and it is the only one of the implementation parsers that has been explicitly designed to also be used by other clients and not just the interpreter itself. (For example, the Eclipse RDT plugin, the Eclipse DLTK/Ruby plugin, the NetBeans Ruby plugin and the jEdit Ruby syntax highlighting all use JRuby's parser.) To facilitate that, JRuby's parser has actually been repackaged as a separate project.

Of course, there are YACC clones for pretty much every language on the planet. However, be aware that YARV does not use a lex generated scanner. It uses a hand-written scanner in C, and also the YACC grammar contains quite a bit of semantic actions in C. Those parts will have to be re-implemented (like they were in JRuby).

The XRuby compiler is the only full Ruby implementation that does not use YARV's parse.y, it uses an ANTLRv3 grammar and an ANTLRv3 tree grammar that have been developed from scratch. ANTLR can generate parsers for a whole bunch of languages, including for example Java and C#. Its Ruby backend, however, is in dire need of some work.

RedParse is a Ruby parser written in Ruby, which claims to be able to parse all Ruby syntax correctly. It is used, for example, in the YARD Ruby documentation tool to, among other things, extract method names.

ruby_parser is another Ruby parser in Ruby. It is generated from parse.y via the racc parser generator that is part of Ruby's standard library.

YARV actually contains a parser library called ripper, which allows you to parse Ruby code. Unfortunately, it is completely undocumented, so you basically have to figure it out by reading blog posts. Except of course, being undocumented, almost nobody else has figured it out yet, either and written a blog post.

However, for your purposes, you don't actually need a full-blown Ruby parser. You only need enough to extract method names and some other stuff.

RDoc, the Ruby documentation generator, contains a Ruby parser which can parse just enough Ruby to, well, extract method names and some other stuff.

Cardinal is a Ruby implementation for the Parrot Virtual Machine. It does not yet run all of Ruby, but its parser should be powerful enough to support all you need. (The parser is written in the Parrot Grammar Engine, so you will obviously have to run it in Parrot, by for example writing your reporting tool in Perl6.)

tinyrb is another Ruby implementation that does not run full Ruby but contains a better written parser than YARV. In this case, the parser uses Ian Piumarta's leg Parsing Expression Grammar parser generator.

Jörg W Mittag 2010-04-15 16:41:55

+1 A:

The DMS Software Reengineering Toolkit has full, robust C# and Python parsers that automatically build complete ASTs. DMS offers facilities for walking the trees and collecting whatever data you might wish to collect.

Another poster's answer here suggests Ruby is really hard to parse. C++ is also famously hard to parse. DMS has been used to parse some 30 other languages, including full C++ in a number of dialects, so Ruby seems eminently doable. Howeever, DMS doesn't have an off-the-shelf parser for Ruby.

Ira Baxter 2010-04-16 02:49:36

C++ is definitely harder to parse than Ruby. But it has a specification. The reason that Ruby is hard to parse is that the syntax "specification" for Ruby is 10000 lines of C and YACC. (This is changing now with the ISO Ruby Specification, but that is only a very small subset of what you would normally call "Ruby", and wouldn't even be capable of parsing even the simplest real-world scripts.)

Jörg W Mittag 2010-04-17 00:00:10

So if "real" Ruby systems have only opaque (e.g. C/YACC implicit) specifications, how on earth does a Ruby programmer know what he can write? Or, what he might encounter, when reading the langauge? I have to say, this doesn't sound like a recommendation in favor of Ruby.

Ira Baxter 2010-04-17 05:33:57

ansaurus

tags:

views:

answers:

Language parsers

related questions