views:

732

answers:

8

I have a config file that is in the following form:

protocol sample_thread {
    { AUTOSTART 0 }
    { BITMAP thread.gif }
    { COORDS {0 0} }
    { DATAFORMAT {
        { TYPE hl7 }
        { PREPROCS {
            { ARGS {{}} }
            { PROCS sample_proc }
        } }
    } } 
}

The real file may not have these exact fields, and I'd rather not have to describe the the structure of the data is to the parser before it parses.

I've looked for other configuration file parsers, but none that I've found seem to be able to accept a file of this syntax.

I'm looking for a module that can parse a file like this, any suggestions?

If anyone is curious, the file in question was generated by Quovadx Cloverleaf.

A: 

Maybe you could write a simple script that will convert your config into xml file and then read it just using lxml, Beatuful Soup or anything else? And your converter could use PyParsing or regular expressions for example.

paffnucy
So, you're suggesting he parses the script, generates an XML, and then parses the XML file? Am I the only one that sees a problem with this aproach ?
Geo
Yes he should create a general parser that takes this script, along with a description of the format in xml - and convert the output into xml ;-)
Martin Beckett
why? because he can easily convert it to xml (with using a simple stack), and then it'll be easier for him to use xml parser - there're a lot of good and simple tutorials on this, and Benjamin seems to be a newbie in python. I mean xml parser is easier than custom format parser.
paffnucy
+1  A: 

I'll try and answer what I think is the missing question(s)...

Configuration files come in many formats. There are well known formats such as *.ini or apache config - these tend to have many parsers available.

Then there are custom formats. That is what yours appears to be (it could be some well-defined format you and I have never seen before - but until you know what that is it doesn't really matter).

I would start with the software this came from and see if they have a programming API that can load/produce these files. If nothing is obvious give Quovadx a call. Chances are someone has already solved this problem.

Otherwise you're probably on your own to create your own parser.

Writing a parser for this format would not be terribly difficult assuming that your sample is representative of a complete example. It's a hierarchy of values where each node can contain either a value or a child hierarchy of values. Once you've defined the basic types that the values can contain the parser is a very simple structure.

You could write this reasonably quickly using something like Lex/Flex or just a straight-forward parser in the language of your choosing.

Bubbafat
A: 

I searched a little on the Cheese Shop, but I didn't find anything helpful for your example. Check the Examples page, and this specific parser ( it's syntax resembles yours a bit ). I think this should help you write your own.

Geo
A: 

Look into LEX and YACC. A bit of a learning curve, but they can generate parsers for any language.

Erika
+1  A: 

You can easily write a script in python which will convert it to python dict, format looks almost like hierarchical name value pairs, only problem seems to be Coards {0 0}, where {0 0} isn't a name value pair, but a list so who know what other such cases are in the format I think your best bet is to have spec for that format and write a simple python script to read it.

Anurag Uniyal
+10  A: 

pyparsing is pretty handy for quick and simple parsing like this. A bare minimum would be something like:

import pyparsing
string = pyparsing.CharsNotIn("{} \t\r\n")
group = pyparsing.Forward()
group << pyparsing.Group(pyparsing.Literal("{").suppress() + 
                         pyparsing.ZeroOrMore(group) + 
                         pyparsing.Literal("}").suppress()) 
        | string

toplevel = pyparsing.OneOrMore(group)

The use it as:

>>> toplevel.parseString(text)
['protocol', 'sample_thread', [['AUTOSTART', '0'], ['BITMAP', 'thread.gif'], 
['COORDS', ['0', '0']], ['DATAFORMAT', [['TYPE', 'hl7'], ['PREPROCS', 
[['ARGS', [[]]], ['PROCS', 'sample_proc']]]]]]]

From there you can get more sophisticated as you want (parse numbers seperately from strings, look for specific field names etc). The above is pretty general, just looking for strings (defined as any non-whitespace character except "{" and "}") and {} delimited lists of strings.

Brian
+1 for pyparsing
Nadia Alramli
+1  A: 

Your config file is very similar to JSON (pretty much, replace all your "{" and "}" with "[" and "]"). Most languages have a built in JSON parser (PHP, Ruby, Python, etc), and if not, there are libraries available to handle it for you.

If you can not change the format of the configuration file, you can read all file contents as a string, and replace all the "{" and "}" characters via whatever means you prefer. Then you can parse the string as JSON, and you're set.

Mike Trpcic
not really - json needs commas between values and quotes around strings.
nosklo
A: 

Taking Brian's pyparsing solution another step, you can create a quasi-deserializer for this format by using the Dict class:

import pyparsing

string = pyparsing.CharsNotIn("{} \t\r\n")
# use Word instead of CharsNotIn, to do whitespace skipping
stringchars = pyparsing.printables.replace("{","").replace("}","")
string = pyparsing.Word( stringchars )
# define a simple integer, plus auto-converting parse action
integer = pyparsing.Word("0123456789").setParseAction(lambda t : int(t[0]))
group = pyparsing.Forward()
group << ( pyparsing.Group(pyparsing.Literal("{").suppress() +
    pyparsing.ZeroOrMore(group) +
    pyparsing.Literal("}").suppress())
    | integer | string )

toplevel = pyparsing.OneOrMore(group)

sample = """
protocol sample_thread {
    { AUTOSTART 0 }
    { BITMAP thread.gif }
    { COORDS {0 0} }
    { DATAFORMAT {
        { TYPE hl7 }
        { PREPROCS {
            { ARGS {{}} }
            { PROCS sample_proc }
        } }
    } } 
    }
"""

print toplevel.parseString(sample).asList()

# Now define something a little more meaningful for a protocol structure, 
# and use Dict to auto-assign results names
LBRACE,RBRACE = map(pyparsing.Suppress,"{}")
protocol = ( pyparsing.Keyword("protocol") + 
             string("name") + 
             LBRACE + 
             pyparsing.Dict(pyparsing.OneOrMore(
                pyparsing.Group(LBRACE + string + group + RBRACE)
                ) )("parameters") + 
             RBRACE )

results = protocol.parseString(sample)
print results.name
print results.parameters.BITMAP
print results.parameters.keys()
print results.dump()

Prints

['protocol', 'sample_thread', [['AUTOSTART', 0], ['BITMAP', 'thread.gif'], ['COORDS', 

[0, 0]], ['DATAFORMAT', [['TYPE', 'hl7'], ['PREPROCS', [['ARGS', [[]]], ['PROCS', 'sample_proc']]]]]]]
sample_thread
thread.gif
['DATAFORMAT', 'COORDS', 'AUTOSTART', 'BITMAP']
['protocol', 'sample_thread', [['AUTOSTART', 0], ['BITMAP', 'thread.gif'], ['COORDS', [0, 0]], ['DATAFORMAT', [['TYPE', 'hl7'], ['PREPROCS', [['ARGS', [[]]], ['PROCS', 'sample_proc']]]]]]]
- name: sample_thread
- parameters: [['AUTOSTART', 0], ['BITMAP', 'thread.gif'], ['COORDS', [0, 0]], ['DATAFORMAT', [['TYPE', 'hl7'], ['PREPROCS', [['ARGS', [[]]], ['PROCS', 'sample_proc']]]]]]
  - AUTOSTART: 0
  - BITMAP: thread.gif
  - COORDS: [0, 0]
  - DATAFORMAT: [['TYPE', 'hl7'], ['PREPROCS', [['ARGS', [[]]], ['PROCS', 'sample_proc']]]]

I think you will get further faster with pyparsing.

-- Paul

Paul McGuire