views:

143

answers:

4

I have a huge grammar developed for pyparsing as part of a large, pure python application. I have reached the limit of performance tweaking and I'm at the point where the diminishing returns make me start to look elsewhere. Yes, I think I know most of the tips and tricks and I've profiled my grammar and my application to dust.

What next?

I hope to find a parser that gives me the same readability, usability (I'm using many advanced features of pyparsing such as parse-actions to start the post processing of the input which it is being parsed) and python integration but at 10X the performance.

I love the fact the the grammar is pure python.

All my basic blocks are regular expressions, so reusing them would be nice.

I know I can't have everything so I am willing to give up on some of the features I have today to get to the requested 10X performance.

Where do I go from here?

+3  A: 

It looks like the pyparsing folks have anticipated your problem. From http://pyparsing.wikispaces.com/HowToUsePyparsing :

Performance of pyparsing may be slow for complex grammars and/or large input strings. The psyco package can be used to improve the speed of the pyparsing module with no changes to grammar or program logic - observed improvments have been in the 20-50% range.

If you're really a speed demon, but want to keep some of the legibility and declarative syntax, I'd suggest having a look at ANTLR. Probably not the Python-generating backend; I'm skeptical whether that's mature or high-performance enough for your needs. I'm talking about the goods: the C backend that started it all.

Wrap a Python C extension module around the entry point to the parser, and turn it loose.

Having said that, you'll be giving up a lot in this transition: basically any Python you want to do in your parser will have to be done through the C API (not altogether pretty). Also, you'll have to get used to very different ways of doing things. ANTLR has its charms, but it's not based on combinators, so there's not the easy and fluid relationship between your grammar and your language that there is in pyparsing. Plus, it's its own DSL, much like lex/yacc, which can present a learning curve – but, because it's LL based, you'll probably find it easier to adapt to your needs.

Owen S.
A: 

There's no way to know what kind of benefit you'll get without just testing it, but it's within the range of possibility that you could get 10x benefit just from using Unladen Swallow if your process is long-running and repetitive. (Also, if you have many things to parse and you typically start a new interpreter for each one, Unladen Swallow gets faster - to a point - the longer you run your process, so while parsing one input might not show much gain, you might get significant gains on the 2nd and 3rd inputs in the same process).

(Note: pull the latest out of SVN - you'll get far better performance than the latest tarball)

Nick Bastin
Nick - I started reading about US, and while I'm installing, compiling and building I came across the Pycon2010 presentation benchmarks. I don't see any benchmark with even a 2X gain over CPython 2.6.4! Why do you expect anything better? That said, this is the easiest option so I might as well just try it...
Tal Weiss
@Tal: My own personal experience, really (able to get a 3.5-4x speed improvement on some parsing code). The US benchmarks are real world benchmarks, which is useful, but they miss the benefits of retooling your code to more benefit from US - specifically by creating a long-running process instead of a bunch of short lived processes to do your work. When I did my parsing tests, the difference in parsing one file was marginal - maybe 5-10% faster - but by the time the 10th file was fed through the same parser process, it was operating almost 400% faster.
Nick Bastin
A: 

Switch to a generated C/C++ parser (using ANTLR, flex/bison, etc.). If you can delay all the action rules until after you are done parsing, you might be able to build an AST with trivial code and then pass that back to your python code via something like SWIG and process on it with your current actions rules. OTOH, for that to give you a speed boost, the parsing has to be the heavy lifting. If your action rules are the big cost, then this will buy you nothing unless you write your action rules in C as well (but you might have to do it anyway to avoid paying for whatever impedance mismatch you get between the python and C code).

BCS
A: 

If you really want performance for large grammars, look no farther than SimpleParse (which itself relies on mxTextTools, a C extension). However, know now that it comes at the cost of being more cryptic and requiring that you be well-versed in EBNF.

It's definitely not the more Pythonic route, and you're going to have to start all over with an EBNF grammar to use SimpleParse.

jathanism