views:

77

answers:

1

Dear Overflowns:

I have an executable whose input is contained in an ASCII file with format:

$ GENERAL INPUTS
$ PARAM1 = 123.456
PARAM2=456,789,101112
PARAM3(1)=123,456,789
PARAM4       =
1234,5678,91011E2
PARAM5(1,2)='STRING','STRING2'
$ NEW INSTANCE
NEW(1)=.TRUE.
PAR1=123
[More data here]
$ NEW INSTANCE
NEW(2)=.TRUE.
[etcetera]

In other words, some general inputs, and some parameter values for a number of new instances. The declaration of parameters is irregular; some numbers are separated by commas, others are in scientific notation, others are inside quotes, the spacing is not constant, etc.

The evaluation of some scenarios requires that I take the input of one "master" data file and copy the parameter data of, say, instances 2 through 6 to another data file which may already contain data for said instances (in which case data should be overwritten) and possibly others (data which should be left unchanged).

I have written a Flex lexer and a Bison parser; together they can eat a data file and store the parameters in memory. If I use them to open both files (master and "scenario"), it should not be too hard to selectively write to a third, new file the desired parameters (as in "general input from 'scenario'; instances 1 though 5 from 'master'; instances 6 through 9 from 'scenario'; ..."), save it, and delete the original scenario file.

Other information: (1) the files are highly sensitive, it is very important that the user is completely shielded from altering the master file; (2) the files are of manageable size (from 500K to 10M).

I have learned that what I can do in ten lines of code, some fellow here can do in two. How would you approach this problem? A Pythonic answer would make me cry. Seriously.

+1  A: 

If you're already able to parse this format (I'd have tried it with pyParsing, but if you already have a working flexx/bison solution, that will be just fine), and the parsed data fit well in memory, then you're basically there. You can represent what you read from each file as a simple object with a dict for "general input" and a list of dicts, one per instance (or probably better a dict of instances, with the keys being the instance-numbers, which may give you a bit more flexibility). Then, as you mentioned, you just selectively "update" (add or overwrite) some of the instances copied from the master into the scenario, write the new scenario file, replace the old one with it.

To use the flexx/bison code with Python you have several options -- make it into a DLL/so and access it with ctypes, or call it from a cython-coded extension, a SWIG wrapper, a Python C-API extension, or SIP, Boost, etc etc.

Suppose that, one way or another, you have a parser primitive that (e.g.) accepts an input filename, reads and parses that file, and returns a list of 2-string tuples, each of which is either of the following:

  • (paramname, paramvalue)
  • ('$$$$', 'General Inputs')
  • ('$$$$', 'New Instance')

just using '$$$$' as a kind of arbitrary marker. Then for the object representing all that you've read from a file you might have:

import re

instidre = re.compile(r'NEW\((\d+)\)')

class Afile(object):

  def __init__(self, filename):
    self.filename = filename
    self.geninput = dict()
    self.instances = dict()

  def feed_data(self, listoftuples):
    it = iter(listoftuples)
    assert next(it) == ('$$$$', 'General Inputs')
    for name, value in it:
      if name == '$$$$': break
      self.geninput[name] = value
    else:  # no instances at all!
      return
    currinst = dict()
    for name, value in it:
      if name == '$$$$':
        self.finish_inst(currinst)
        currinst = dict()
        continue
      mo = instidre.match(name)
      if mo:
        assert value == '.TRUE.'
        name = '$$$INSTID$$$'
        value = mo.group(1)
      currinst[name] = value
    self.finish_inst(currinst)

  def finish_inst(self, adict):
    instid = dict.pop('$$$INSTID$$$')
    assert instid not in self.instances
    self.instances[instid] = adict

Sanity checking might be improved a bit, diagnosing anomalies more precisely, but net of error cases I think this is roughly what you want.

The merging just requires doing foo.instances[instid] = bar.instances[instid] for the required values of instid, where foo is the Afile instance for the scenario file and bar is the one for the master file -- that will overwrite or add as required.

I'm assuming that to write out the newly changed scenario file you don't need to repeat all the formatting quirks the specific inputs might have (if you do, then those quirks will need to be recorded during parsing together with names and values), so simply looping on sorted(foo.instances) and writing each out also in sorted order (after writing the general stuff also in sorted order, and with appropriate $ this and that marker lines, and with proper translation of the '$$$INSTID$$$' entry, etc) should suffice.

Alex Martelli
Darn! I wanted to see your pyparsing solution!
Paul McGuire
I like your error checking (I did not think of asserting that the parameters within an instance corresponded to a .TRUE. value). For the time being I will stick to my Flex/Bison parser and will implement this Python part for the writing of the new file and sanity checks. Thanks!
Arrieta
@Arrieta, yes, I did suggest sticking to Flex/Bison, since you have it working -- just interfacing it to Python for the "logical" (as opposed to parsing) part of the job.
Alex Martelli