ansaurus

Question

Resources for lexing, tokenising and parsing in python

Answer 1

+2 A:

Have a look at the standard module shlex and modify one copy of it to match the syntax you use for your shell, it is a good starting point

If you want all the power of a complete solution for lexing/parsing, ANTLR can generate python too.

PW 2008-08-31 17:14:06

Answer 2

+2 A:

I suggest http://www.canonware.com/Parsing/, since it is pure python and you don't need to learn a grammar, but it isn't widely used, and has comparatively little documentation. The heavyweight is ANTLR and PyParsing. ANTLR can generate java and C++ parsers too, and AST walkers but you will have to learn what amounts to a new language.

nt 2008-08-31 23:14:54

Answer 3

+6 A:

I'm a happy user of PLY. It is a pure-Python implementation of Lex & Yacc, with lots of small niceties that make it quite Pythonic and easy to use. Since Lex & Yacc are the most popular lexing & parsing tools and are used for the most projects, PLY has the advantage of standing on giants' shoulders. A lot of knowledge exists online on Lex & Yacc, and you can freely apply it to PLY.

PLY also has a good documentation page with some simple examples to get you started.

For a listing of lots of Python parsing tools, see this.

Eli Bendersky 2008-09-20 05:07:57

I second the recommendation for PLY, it's great.

mipadi 2008-11-11 01:46:09

Answer 4

+1 A:

pygments is a source code syntax highlighter written in python. It has lexers and formatters, and may be interesting to peek at the source.

nilamo 2008-09-20 05:15:57

Answer 5

+5 A:

For medium-complex grammars, PyParsing is brilliant. You can define grammars directly within Python code, no need for code generation:

>>> from pyparsing import Word, alphas
>>> greet = Word( alphas ) + "," + Word( alphas ) + "!" # <-- grammar defined here
>>> hello = "Hello, World!"
>>>> print hello, "->", greet.parseString( hello )
Hello, World! -> ['Hello', ',', 'World', '!']

(Example taken from the PyParsing home page).

With parse actions (functions that are invoked when a certain grammar rule is triggered), you can convert parses directly into abstract syntax trees, or any other representation.

There are many helper functions that encapsulate recurring patterns, like operator hierarchies, quoted strings, nesting or C-style comments.

Torsten Marek 2008-09-26 01:05:35

For what it's worth, I've always had trouble with PyParsing. I've tried to use it a few times and never been fully satisfied with the result (eg, it's taken a long time, been hard to debug, required more code then I expected, etc). I can't say if this is due to my ignorance or a failing in PyParsing, though…

David Wolever 2010-09-10 17:45:45

Answer 6

+4 A:

Here's a few things to get you started (roughly from simplest-to-most-complex, least-to-most-powerful):

http://en.wikipedia.org/wiki/Recursive_descent_parser

http://en.wikipedia.org/wiki/Top-down_parsing

http://en.wikipedia.org/wiki/LL_parser

http://effbot.org/zone/simple-top-down-parsing.htm

http://en.wikipedia.org/wiki/Bottom-up_parsing

http://en.wikipedia.org/wiki/LR_parser

http://en.wikipedia.org/wiki/GLR_parser

When I learned this stuff, it was in a semester-long 400-level university course. We did a number of assignments where we did parsing by hand; if you want to really understand what's going on under the hood, I'd recommend the same approach.

This isn't the book I used, but it's pretty good: Principles of Compiler Design.

Hopefully that's enough to get you started :)

Tony Arkles 2008-11-11 01:13:42

ansaurus

tags:

views:

answers:

Resources for lexing, tokenising and parsing in python

related questions