views:

231

answers:

4

I have a quite big custom made config file I need to extract data from once a week. This is an "in house" config file which doesn't comply to any know standard like INI or such.

My quick and dirty approach was to use re to search for the section header I want and then extract the one or 2 lines of information under this header that I want. This is proving quite a challenge and I'm thinking there must be a easier/more reliable way of doing this, but I keep thinking that I will need to implement a full parser to parse this file and then only to extract the 5 lines of data I need.

The "sections" looks something like this:

Registry com.name.version =
Registry "unique-name I search for using re" =
    String name = "modulename";
    String timestamp = "not specified";
    String java = "not specified";
    String user = "not specified";
    String host = "not specified";
    String system = "not specified";
    String version = "This I want";
    String "version-major" = "not specified";
    String "version-minor" = "not specified";
    String scm = "not specified";
    String scmrevision = "not specified";
    String mode = "release";
    String teamCityBuildNumber = "not specified";
;
A: 

Regexp have no state, so you can't use them to parse a complex input. But you can load the file into a string, use a regexp to find a substring and then cut the string at that place.

In your case, search for r'unique-name I search for using re"\s*=\s*', then cut after the match. Then search for r'\n\s*;\s*\n' and cut before the match. This leaves you with the values which you can chop using another regexp.

Aaron Digulla
This sounds reasonable, I'll look a bit more into this approach, thanks Aaron.
NomadAlien
A: 

If you only look for special content, using regexp is fine; if you need to read everything, you should rather build yourself a parser.

>> s = ''' ... ''' # as above
>> t = re.search( 'Registry "unique-name" =(.*?)\n;', s, re.S ).group( 1 )
>> u = re.findall( '^\s*(\w+) "?(.*?)"? = "(.*?)";\s*$', t, re.M )
>> for x in u:
       print( x )

('String', 'name', 'modulename')
('String', 'timestamp', 'not specified')
('String', 'java', 'not specified')
('String', 'user', 'not specified')
('String', 'host', 'not specified')
('String', 'system', 'not specified')
('String', 'version', 'This I want')
('String', 'version-major', 'not specified')
('String', 'version-minor', 'not specified')
('String', 'scm', 'not specified')
('String', 'scmrevision', 'not specified')
('String', 'mode', 'release')

edit: Although the above version should work for multiple Registry sections, here is a more stricter version:

t = re.search( 'Registry "unique-name"\s*=\s*((?:\s*\w+ "?[^"=]+"?\s*=\s*"[^"]*?";\s*)+)\s*;', s ).group( 1 )
u = re.findall( '^\s*(\w+) "?([^"=]+)"?\s*=\s*"([^"]*?)";\s*$', t, re.M )
poke
Ugh! This just shows how much I still have to learn about regexp's! Thanks poke!, this is very helpful and educational :-)
NomadAlien
poke, I'm having some problem understanding your whole regexp. Your example seems to find the correct section but then it doesn't stop at the and of the section when it reaches the ";", so it keeps reading ALL the sections thereafter. Is it possible to do it so that I only extract this one whole section?
NomadAlien
It worked for me (I added another *Registry* section in my test), but just to be safe, I have updated my post to include a stricter expression, that additionally makes sure that only those setting lines are following.
poke
Fantastic! This works beautiful!Thanks poke!!
NomadAlien
A: 

I think you should create simple parser which create dictionaries of sections with dictionaries of keys. Something like:

#!/usr/bin/python

import re

re_section = re.compile('Registry (.*)=', re.IGNORECASE)
re_value = re.compile('\s+String\s+(\S+)\s*=\s*(.*);')

txt = '''
Registry com.name.version =
Registry "unique-name I search for using re" =
        String name = "modulename";
        String timestamp = "not specified";
        String java = "not specified";
        String user = "not specified";
        String host = "not specified";
        String system = "not specified";
        String version = "This I want";
        String "version-major" = "not specified";
        String "version-minor" = "not specified";
        String scm = "not specified";
        String scmrevision = "not specified";
        String mode = "release";
        String teamCityBuildNumber = "not specified";
'''

my_config = {}
section = ''
lines = txt.split('\n')
for l in lines:
    rx = re_section.search(l)
    if rx:
        section = rx.group(1)
        section = section.strip('" ')
        continue
    rx = re_value.search(l)
    if rx:
        (k, v) = (rx.group(1).strip('" '), rx.group(2).strip('" '))
        try:
            my_config[section][k] = v
        except KeyError:
            my_config[section] = {k: v}

Then if you:

print my_config["unique-name I search for using re"]['version']

it will output:

This I want
Michał Niklas
Nice!....Thanks Michal!
NomadAlien
+1  A: 

A simple parser using pyparsing can give you something close to a deserializer, that would let you access fields by key name (like in a dict), or as attributes. Here is the parser:

from pyparsing import (Suppress,quotedString,removeQuotes,Word,alphas,
        alphanums, printables,delimitedList,Group,Dict,ZeroOrMore,OneOrMore)

# define punctuation and constants - suppress from parsed output
EQ,SEMI = map(Suppress,"=;")
REGISTRY = Suppress("Registry")
STRING = Suppress("String")

# define some basic building blocks
quotedString.setParseAction(removeQuotes)
ident = quotedString | Word(printables)
value = quotedString
java_path = delimitedList(Word(alphas,alphanums+"_"), '.', combine=True)

# define the config file sections
string_defn = Group(STRING + ident + EQ + value + SEMI)
registry_section = Group(REGISTRY + ident + EQ + Dict(ZeroOrMore(string_defn)))

# special definition for leading java module
java_module = REGISTRY + java_path("path") + EQ

# define the overall config file format
config = java_module("java") + Dict(OneOrMore(registry_section))

Here is a test using your data (read from your data file into config_source):

data = config.parseString(config_source)
print data.dump()
print data["unique-name I search for using re"].version
print data["unique-name I search for using re"].mode
print data["unique-name I search for using re"]["version-major"]

Prints:

['com.name.version', ['unique-name I search for using re', ...
- java: ['com.name.version']
  - path: com.name.version
- path: com.name.version
- unique-name I search for using re: [['name', 'modulename'], ...
  - host: not specified
  - java: not specified
  - mode: release
  - name: modulename
  - scm: not specified
  - scmrevision: not specified
  - system: not specified
  - teamCityBuildNumber: not specified
  - timestamp: not specified
  - user: not specified
  - version: This I want
  - version-major: not specified
  - version-minor: not specified
This I want
release
not specified
Paul McGuire