views:

175

answers:

1

I am parsing a file with python and pyparsing (it's the report file for PSAT in Matlab but that isn't important). here is what I have so far. I think it's a mess and would like some advice on how to improve it. Specifically, how should I organise my grammar definitions with pyparsing?

Should I have all my grammar definitions in one function? If so, it's going to be one huge function. If not, then how do I break it up. At the moment I have split it at the sections of the file. Is it worth making loads of functions that only ever get called once from one place. Neither really feels right to me.

Should I place all my input and output code in a separate file to the other class functions? It would make the purpose of the class much clearer.

I'm also interested to know if there is an easier way to parse a file, do some sanity checks and store the data in a class. I seem to spend a lot of my time doing this.

(I will accept answers of it's good enough or use X rather than pyparsing if people agree)

+1  A: 

I could go either way on using a single big method to create your parser vs. taking it in steps the way you have it now.

I can see that you have defined some useful helper utilities, such as slit ("suppress Literal", I presume), stringtolits, and decimaltable. This looks good to me.

I like that you are using results names, they really improve the robustness of your post-parsing code. I would recommend using the shortcut form that was added in pyparsing 1.4.7, in which you can replace

busname.setResultsName("bus1")

with

busname("bus1")

This can declutter your code quite a bit.

I would look back through your parse actions to see where you are using numeric indexes to access individual tokens, and go back and assign results names instead. Here is one case, where GetStats returns (ngroup + sgroup).setParseAction(self.process_stats). process_stats has references like:

self.num_load = tokens[0]["loads"]
self.num_generator = tokens[0]["generators"]
self.num_transformer = tokens[0]["transformers"]
self.num_line = tokens[0]["lines"]
self.num_bus = tokens[0]["buses"]
self.power_rate = tokens[1]["rate"]

I like that you have Group'ed the values and the stats, but go ahead and give them names, like "network" and "soln". Then you could write this parse action code as (I've also converted to the - to me - easier-to-read object-attribute notation instead of dict element notation):

self.num_load = tokens.network.loads
self.num_generator = tokens.network.generators
self.num_transformer = tokens.network.transformers
self.num_line = tokens.network.lines
self.num_bus = tokens.network.buses
self.power_rate = tokens.soln.rate

Also, a style question: why do you sometimes use the explicit And constructor, instead of using the '+' operator?

busdef = And([busname.setResultsName("bus1"),
            busname.setResultsName("bus2"),
            integer.setResultsName("linenum"),
            decimaltable("pf qf pl ql".split())])

This is just as easily written:

busdef = (busname("bus1") + busname("bus2") + 
            integer("linenum") + 
            decimaltable("pf qf pl ql".split()))

Overall, I think this is about par for a file of this complexity. I have a similar format (proprietary, unfortunately, so cannot be shared) in which I built the code in pieces similar to the way you have, but in one large method, something like this:

def parser():
    header = Group(...)
    inputsummary = Group(...)
    jobstats = Group(...)
    measurements = Group(...)
    return header("hdr") + inputsummary("inputs") + jobstats("stats") + measurements("meas")

The Group constructs are especially helpful in a large parser like this, to establish a sort of namespace for results names within each section of the parsed data.

Paul McGuire
Just the sort of answer I was looking for. The name shortcut will make things look nicer. object-attribute notation is new to me; does it work with all dicts? either way, thanks. I used and because when I wrote those parts I forgot about implied line continuation inside parentheses. I am thinking of having PSATReport_IO as a seperate class, there really isn't much other functions that are needed and it's looks too cluttered.
James Brooks
dd = dict(zip("a b c d".split(),range(4))); dd.a; dd["a"]No it doesn't work. Nice syntax though. I might have to try and implement it in a custom dict class.
James Brooks
No, object attribute-style referencing is a feature of pyparsing's ParseResults, not found in standard Python dicts. As an additional datapoint, my parser for the proprietary format is about 900 lines long.
Paul McGuire